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Introduction 



Purpose 

This guide is designed to give you information about programming in the 
UNIX System V/386 operating system environment. It does not attempt to 
teach you how to write programs. Rather, it is intended to supplement texts 
on programming languages by concentrating on the other elements that are 
part of getting programs into operation. 

Audience and Prerequisite Knowledge 

As the title suggests, we are addressing programmers, especially those 
who have not worked extensively with the UNIX System. No special level of 
programming involvement is assumed. We hope the book will be useful to 
people who write only an occasional program, as well as to those who work 
on or manage large application development projects. 

Programmers in the expert class, or those engaged in developing system 
software, may find this guide lacks the depth of information they need. For 
them we recommend the Programmer's Reference Manual 

Knowledge of terminal use, of a UNIX System editor, and of the UNIX 
System directory/file structure is assumed. If you feel shaky about your 
mastery of these basic tools, you might want to look over the User's Guide 
before tackling this one. The material is organized into eighteen chapters. 



The C Connection 

The UNIX System supports many programming languages, and C com- 
pilers are available on many different operating systems. Nevertheless, the 
relationship between the UNIX Operating System and C has always been and 
remains very close. Most of the code in the UNIX Operating System is C, and 
over the years many organizations using the UNIX System have come to use 
C for an increasing portion of their application code. Thus, while this guide is 
intended to be useful to you no matter what language(s) you are using, you 
will find that, unless there is a specific language-dependent point to be made, 
the examples assume you are programming in C. 
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Hardware/Software Dependencies 

The text reflects the way things work on your computer running UNIX 
System V/386 at the Release 3.2 level. If you find commands that work a lit- 
tle differently in your UNIX System environment, it may be because you are 
running under a different release of the software. If some commands do not 
seem to exist at all, they may be members of packages not installed on your 
system. Appendix A describes the command packages available on your com- 
puter. If you do find yourself trying to execute a non-existent command, 
check Appendix A, then talk to the administrators of your system. 

Notational Conventions 

Whenever the text includes examples of output from the computer and/or 
commands entered by you, we follow the standard notation scheme that is 
common throughout UNIX System documentation: 

■ Commands, options, arguments to commands, and directory and file 
names that you type in from your terminal are shown in bold type. 

■ Text that is printed on your terminal by the computer is shown in 
constant width type. Constant width type is also used for code sam- 
ples because it allows the most accurate representation of spacing. 
Spacing is often a matter of coding style but is sometimes critical. 

■ Comments added to a display to show that part of the display has 
been omitted are shown in italic type and are indented to separate 
them from the text that represents computer output or input. Com- 
ments that explain the input or output are shown in the same type font 
as the rest of the display. 

Italics are also used to show substitutable values, such as, filename, 
when the format of a command is shown. 

■ There is an implied RETURN at the end of each command and menu 
response you enter. Where you may be expected to enter only a 
RETURN (as in the case where you are accepting a menu default), the 
symbol <CR> is used. 
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■ In cases where you are expected to enter a control character, it is 
shown as, for example, CTRL-D. This means that you press the d key 
on your keyboard while holding down the CTRL key. 

■ The dollar sign ($) and pound sign (#) are the standard default prompt 
signs for an ordinary user and root. $ means you are logged in as an 
ordinary user, # means you are logged in as root. 

■ When the # prompt is used in an example, it means the command 
illustrated may be used only by root. 



Command References 

When commands are mentioned in a section of the text for the first time, a 
reference to the manual section where the command is formally described is 
included in parentheses, that is, command(section). Numbered sections are 
located in the following manuals: 

Sections (1, IM), (7) Usefs/System Administrator's Reference Manual 

Sections (1), (2), (3), (4), (5) Programmer's Reference Manual 



Information in tiie Examples 

While every effort has been made to present displays of information just 
as they appear on your terminal, it is possible that your system may produce 
slightly different output. Some displays depend on a particular machine con- 
figuration that may differ from yours. Changes between releases of the UNIX 
System software may cause small differences in what appears on your termi- 
nal. 

Where complete code samples are shown, we have tried to make sure 
they compile and work as represented. Where code fragments are shown, 
while we cannot say that they have been compiled, we have attempted to 
maintain the same standards of coding accuracy for them. 
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Introduction 



The 1983 Turing Award of the Association for Computing Machinery was 
given jointly to Ken Thompson and Dennis Ritchie, the two men who first 
designed and developed the UNIX Operating System. The award citation 
said, in part: 

The success of the UNIX System stems from its tasteful selec- 
tion of a few key ideas and their elegant implementation. The 
model of the UNIX System has led a generation of software 
designers to new ways of thinking about programming. The 
genius of the UNIX System is its framework which enables 
programmers to stand on the work of others. 

As programmers working in a UNIX System environment, why should we 
care what Thompson and Ritchie did? Does it have any relevance for us 
today? 

It does because if we understand the thinking behind the system design 
and the atmosphere in which it flowered, it can help us become productive 
UNIX System programmers more quickly. 

The Early Days 

You may already have read about how Ken Thompson came across a DEC 
PDP-7 machine sitting unused in a hallway at AT&T Bell Laboratories, and 
how he and Dennis Ritchie and a few of their colleagues used that as the ori- 
ginal machine for developing a new operating system that became UNIX. 

The important thing to realize, however, is that what they were trying to 
do was fashion a pleasant computing environment for themselves. It was not, 
"Let's get together and build an operating system that will attract world-wide 
attention. " 

The sequence in which elements of the system fell into place is interest- 
ing. The first piece was the file system, followed quickly by its organization 
into a hierarchy of directories and files. The view of everything, data stores, 
programs, commands, directories, even devices, as files of one type or another 
was critical, as was the idea of a file as a one-dimensional array of bytes with 
no other structure implied. The cleanness and simplicity of this way of look- 
ing at files has been a major contributing factor to a computer environment 
that programmers and other users have found comfortable. 
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The next element was the idea of processes, with one process being able 
to create another and communicate with it. This innovative way of looking at 
running programs as processes led easily to the practice (quintessentially 
UNIX System) of reusing code by calling it from another process. With the 
addition of commands to manipulate files and an assembler to produce exe- 
cutable programs, the system was essentially able to function on its own. 

The next major development was the acquisition of a DEC PDP-11 and 
the installation of the new system on it. This has been described by Ritchie as 
a stroke of good luck, in that the PDP-11 was to become a hugely successful 
machine, its success to some extent adding momentum to the acceptance of 
the system that began to be known by the name of UNIX System. 

By 1972 the innovative idea of pipes (connecting links between processes 
whereby the output of one becomes the input of the next) had been incor- 
porated into the system, the operating system had been recoded in higher 
level languages (first B, then C) and had been dubbed with the name UNIX 
System (coined by Brian Kemighan). By this point, the "pleasant computing 
environment" sought by Thompson and Ritchie was a reality; but some other 
things were going on that had a strong influence on the character of the pro- 
duct then and today. 

It is worth pointing out that the UNIX System came out of an atmosphere 
that was totally different from that in which most commercially successful 
operating systems are produced. The more typical atmosphere is that 
described by Tracy Kidder in The Soul of a New Machine, In that case, dozens 
of talented programmers worked at white heat, in an atmosphere of extremely 
tight security, against murderous deadlines. By contrast, the UNIX System 
could be said to have had about a ten-year gestation period. From the begin- 
ning it attracted the interest of a growing number of brilliant specialists, many 
of whom found in the UNIX System an environment that allowed them to 
pursue research and development interests of their own, but who, in turn, 
contributed additions to the body of tools available for succeeding ranks of 
UNIX System programmers. 

Beginning in 1971, the system began to be used for applications within 
AT&T Bell Laboratories, and shortly thereafter (1974) was made available at 
low cost and without support to colleges and uiuversities. These versions, 
called research versions and identified with Arabic numbers up through 7, 
occasionally grew on their own and fed back to the main system additional 
innovative tools. The widely-used screen editor vi(l), for example, was added 
to the UNIX System by William Joy at the University of California, Berkeley. 
In 1979, acceding to commercial demand, AT&T began offering supported 
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versions (called development versions) of the UNIX System. These are identi- 
fied with Roman numerals and often have interim release numbers appended. 
The current development version, for example, is UNIX System V/386 
Release 3.2. 

Versions of the UNIX System being offered now are coming from an 
environment more closely related, perhaps, to the standard software factory. 
Features are being added to new releases in response to the expressed needs 
of the market place. The essential quality of the UNIX System, however, 
remains as the product of the innovative thinking of its originators and the 
coUegial atmosphere in which they worked. This quality has on occasion 
been referred to as the UNIX System philosophy, but what is meant is the 
way in which sophisticated programmers have come to work with the UNIX 
System. 



UNIX System Philosophy Simply Stated 

For as long as you are writing programs on a UNIX System you should 
keep this motto hanging on your wall: 

********************* 

* * 

* Build on the work of others * 

* * 
********************* 

Unlike computer environments where each new project is like starting 
with a blank canvas, on a UNIX System a good percentage of any program- 
ming effort is lying there in bins, and Ibins, and /usr/bins, not to mention 
etc, waiting to be used. 

The features of the UNIX System (pipes, processes, and the file system) 
contribute to this reusability, as does the history of sharing and contributing 
that extends back to 1969. You risk missing the essential nature of the UNIX 
System if you do not put this to work. 



PROGRAMMING IN A UNIX SYSTEM ENVIRONMENT: AN OVERVIEW 1-3 



UNIX System Tools and Where You Can 
Read About Them 

The term "UNIX System tools" can stand some clarification. In the nar- 
rowest sense, it means an existing piece of software used as a component in a 
new task. In a broader context, the term is often used to refer to elements of 
the UNIX System that might also be called features, utilities, programs, filters, 
commands, languages, functions, and so on. It gets confusing because any of 
the things that might be called by one or more of these names can be, and 
often are, used in the narrow way as part of the solution to a programming 
problem. 

Tools Covered and Not Covered in this Guide 

The Programmer's Guide is about tools used in the process of creating pro- 
grams in a UNIX System environment, so let us take a minute to talk about 
which tools we mean, which ones are not going to be covered in this book, 
and where you might find information about those not covered here. Actu- 
ally, the subject of things not covered in this guide might be even more 
important to you than the things that are. We could not possibly cover every- 
thing you ever need to know about UNIX System tools in this one volume. 

Tools not covered in this text: 

■ the login procedure 

■ UNIX System editors and how to use them 

■ how the file system is organized and how you move around in it 

■ shell programming 

Information about these subjects can be found in the User's Guide and a 
number of commercially available texts. 

Tools covered here can be classified as follows: 

■ utilities for getting programs running 

■ utilities for organizing software development projects 

■ specialized languages 
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■ debugging and analyzing tools 

■ compiled language components that are not part of the language syn- 
tax, for example, standard libraries, systems calls, and functions 



The Shell as a Prototyping Tool 

Any time you log in to a UNIX System machine you are using the shell. 
The shell is the interactive command interpreter that stands between you and 
the UNIX System kernel, but that is only part of the story. Because of its abil- 
ity to start processes, direct the flow of control, field interrupts, and redirect 
input and output, it is a full-fledged programming language. Programs that 
use these capabilities are known as shell procedures or shell scripts. 

Much innovative use of the shell involves stringing together commands to 
be run under the control of a shell script. The dozens and dozens of com- 
mands that can be used in this way are documented in the User's/System 
Administrator's Reference Manual. Time spent with the User's/System 
Administrator's Reference Manual can be rewarding. Look through it when you 
are trying to find a command with just the right option to handle a knotty 
programming problem. The more familiar you become with the commands 
described in the manual pages, the more you will be able to take full advan- 
tage of the UNIX System environment. 

It is not our purpose here to instruct you in shell programming. What we 
want to stress here is the important part that shell procedures can play in 
developing protot5^es of full-scale applications. While understanding all the 
nuances of shell programming can be a fairly complex task, getting a shell 
procedure up and running is far less time-consuming than writing, compiling, 
and debugging compiled code. 

This ability to get a program into production quickly is what makes the 
shell a valuable tool for program development. Shell programming allows 
you to "build on the work of others" to the greatest possible degree, since it 
allows you to piece together major components simply and efficiently. Many 
times even large applications can be done using shell procedures. Even if the 
application is initially developed as a prototype system for testing purposes 
rather than being put into production, many months of work can be saved. 
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With a protot)T>e for testing, the range of possible user errors can be 
determined — something that is not always easy to plan out when an applica- 
tion is being designed. The method of dealing with strange user input can be 
worked out inexpensively, avoiding large re-coding problems, 

A common occurrence in the UNIX System environment is to find that an 
available UNIX System tool can accomplish with a couple of lines of instruc- 
tions what might take a page and a half of compiled code. Shell procedures 
can intermix compiled modules and regular UNIX System commands to let 
you take advantage of work that has gone before. 
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We distinguish among three programming environments to emphasize 
that the information needs and the way in which UNIX System tools are used 
differ from one environment to another. We do not intend to imply a hierar- 
chy of skill or experience. Highly-skilled programmers with years of experi- 
ence can be found in the "single-user" category, and relative newcomers can 
be members of an application development or systems programming team. 

Single-User Programmer 

Programmers in this environment are writing programs only to ease the 
performance of their primary job. The resulting programs might well be 
added to the stock of programs available to the community in which the pro- 
grammer works. This is similar to the atmosphere in which the UNIX System 
thrived; someone develops a useful tool and shares it with the rest of the 
organization. Single-user programmers may not have externally imposed 
requirements, or co-authors, or project management concerns. The program- 
ming task itself drives the coding very directly. One advantage of a timeshar- 
ing system such as UNIX System is that people with programming skills can 
be set free to work on their own without having to go through formal project 
approval channels and perhaps wait for months for a programming depart- 
ment to solve their problems. 

Single-user programmers need to know how to do the following: 

■ select an appropriate language 

■ compile and run programs 

■ use system libraries 

■ analyze programs 

■ debug programs 

■ keep track of program versions 

Most of the information to perform these functions at the single-user level 
can be found in Chapter 2. 
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Application Programming 

Programmers working in this environment are developing systems for the 
benefit of other, non-programming users. Most large commercial computer 
applications still involve a team of applications development programmers. 
They may be employees of the end-user organization, or they may work for a 
software development firm. Some of the people working in this environment 
may be more in the project management area than working programmers. 

Information needs of people in this environment include all the topics in 
Chapter 2, plus additional information on the following: 

■ software control systems 

■ file and record locking 

■ communication between processes 

■ shared memory 

■ advanced debugging techniques 
These topics are discussed in Chapter 3. 



Systems Programmers 

These are programmers engaged in writing software tools that are part of, 
or closely related to, the operating system itself. The project may involve 
writing a new device driver, a database management system, or an enhance- 
ment to the UNIX System kernel. In addition to knowing their way around 
the operating system source code and how to make changes and enhance- 
ments to it, they need to be thoroughly familiar with all the topics covered in 
Chapters 2 and 3. 
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Summary 

In this overview chapter we have described the way that the UNIX Sys- 
tem developed and the effect that has on the way programmers now work 
with it. We have described what is and is not to be found in the other 
chapters of this guide to help programmers. We have also suggested that in 
many cases programming problems may be easily solved by taking advantage 
of the UNIX System interactive command interpreter known as the shell. 
Finally, we identified three programming environments, in the hope that it 
will help orient the reader to the organization of the text in the remaining 
chapters. 
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Introduction 



The information in this chapter is for anyone just learning to write pro- 
grams to run in a UNIX System environment. In Chapter 1 we identified one 
group of UNIX System users as single-user programmers. People in that 
category, particularly those who are not deeply interested in programming, 
may find that this chapter (plus related reference manuals) tells them as much 
as they need to know about coding and running programs on a UNIX System 
computer. 

Programmers whose interest does run deeper, who are part of an applica- 
tion development project, or who are producing programs on one UNIX Sys- 
tem computer that are being ported to another, should view this chapter as a 
starter package. 
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How do you decide which programming language to use in a given situa- 
tion? One answer could be, "I always code in HAIRBOL, because that's the 
language I know best. " Actually, in some circumstances that is a legitimate 
answer. But, assuming more than one programming language is available to 
you, that different programming languages have their strengths and 
weaknesses, and assuming that once you have learned to use one program- 
ming language it becomes relatively easy to learn to use another, you might 
approach the problem of language selection by asking yourself questions like 
the following: 

■ What is the nature of the task this program is to do? 

Does the task call for the development of a complex algorithm, or is 
this a simple procedure that has to be done on a lot of records? 

■ Does the programming task have many separate parts? 

Can the program be subdivided into separately compilable functions, 
or is it one module? 

■ How soon does the program have to be available? 

Is it needed right now, or do I have enough time to work out the most 
efficient process possible? 

■ What is the scope of its use? 

Am I the only person who will use this program, or is it going to be 
distributed to the whole world? 

■ Is there a possibility the program will be ported to other systems? 

■ What is the life expectancy of the program? 

Is it going to be used just a few times, or will it still be going strong 
five years from now? 



Supported Languages in a UNIX System 
Environment 

By "supported languages" we mean those offered by AT&T for use on 
your computer running UNIX System V/386 Release 3.2. Since these are 
separately purchasable items, not all of them will necessarily be installed on 
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your machine. On the other hand, you may have languages available on your 
machine that came from another source and are not mentioned in this discus- 
sion. Be that as it may, in this section and the one to follow we give brief 
descriptions of the nature of (a) the C programming language, and (b) a 
number of special purpose languages. 

C Language 

The C language is intimately associated with the UNIX System, since it 
was originally developed for use in recoding the UNIX System kernel. If you 
need to use a lot of UNIX System function calls for low-level I/O, memory or 
device management, or inter-process communication, C language is a logical 
first choice. Most programs, however, do not require such direct interfaces 
with the operating system, so the decision to choose C might better be based 
on one or more of the following characteristics: 

■ a variety of data types: character, integer, long integer, float, and 
double 

■ low-level constructs (most of the UNIX System kernel is written in C) 

■ derived data types such as arrays, functions, pointers, structures, and 
unions 

■ multi-dimensional arrays 

■ scaled pointers and the ability to do pointer arithmetic 

■ bit-wise operators 

■ a variety of flow-of-control statements: if, if-else, switch, while, do- 
while, and for 

■ a high degree of portability 

C is a language that lends itself readily to structured programming. It is 
natural in C to think in terms of functions. The next logical step is to view 
each function as a separately compilable unit. This approach (coding a pro- 
gram in small pieces) eases the job of making changes and/or improvements. 
If this begins to sound like the UNIX System philosophy of building new pro- 
grams from existing tools, it is not just coincidence. As you create functions 
for one program, you will surely find that many can be picked up or quickly 
revised for another program. 
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A difficulty with C is that it takes a fairly concentrated use of the language 
over a period of several months to reach your full potential as a C program- 
mer. If you are a casual programmer, you might make life easier for yourself 
if you choose a less demanding language. 

Assembly Language 

The closest approach to machine language, assembly language is specific 
to the particular computer on which your program is to run. High-level 
languages are translated into the assembly language for a specific processor as 
one step of the compilation. The most common need to work in assembly 
language arises when you want to do some task that is not within the scope 
of a high-level language. Since assembly language is machine-specific, pro- 
grams vmtten in it are not portable. 



Special Purpose Languages 

In addition to the above formal programming languages, the UNIX System 
environment frequently offers one or more of the special purpose languages 
listed below. 



NOTE 



Since UNIX System utilities and commands are packaged in functional 
groupings, it is possible that not all the facilities mentioned will be available 
on all systems. 



awk 

awk (its name is an acronym constructed from the initials of its develop- 
ers) scans an input file for lines that match pattem(s) described in a specifica- 
tion file. On finding a line that matches a pattern, awk performs actions also 
described in the specification. It is not uncommon that an awk program can 
be written in a couple of lines to do functions that would take a couple of 
pages to describe in a programming language like FORTRAN or C. For exam- 
ple, consider a case where you have a set of records that consist of a key field 
and a second field that represents a quantity. You have sorted the records by 
the key field, and you now want to add the quantities for records with dupli- 
cate keys and output a file in which no keys are duplicated. 



2-4 PROGRAMMER'S GUIDE 



Language Selection 



The pseudo-code for such a program might look like this: 

Read the first record into a hold area; 
Read additional records until EOF; 

{ 

If the key matches the key of the record in the hold area, 
add the quantity to the quantity field of the held record; 

If the key does not match the key of the held record, 
write the held record, 
move the new record to the hold area; 

} 

At EOF, write out the last record from the hold area. 

An awk program to accomplish this task would look like this: 

{ qty[$1] += $2 } 
END { far (key in qty) print key, qty[key] } 

This illustrates only one characteristic of awk; its ability to work with associa- 
tive arrays. With awk, the input file does not have to be sorted, which is a 
requirement of the pseudo-program. 

lex 

lex is a lexical analyzer that can be added to C programs. A lexical 
analyzer is interested in the vocabulary of a language rather than its grammar, 
which is a system of rules defining the structure of a language, lex can pro- 
duce C language subroutines that recognize regular expressions specified by 
the user, take some action when a regular expression is recognized, and pass 
the output stream on to the next program. 

yacc 

yacc (Yet Another Compiler Compiler) is a tool for describing an input 
language to a computer program, yacc produces a C language subroutine that 
parses an input stream according to rules laid down in a specification file. 
The yacc specification file establishes a set of grammar rules together with 
actions to be taken when tokens in the input match the rules, lex may be 
used with yacc to control the input process and pass tokens to the parser that 
applies the grammar rules. 
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m4 

m4 is a macro processor that can be used as a preprocessor for assembly 
language and C programs. It is described in Section (1) of the Programmer's 
Reference Manual, 

be and dc 

be enables you to use a computer terminal as you would a programmable 
calculator. You can edit a file of mathematical computations and call be to 
execute them. The be program uses de. You can use dc directly if you want, 
but it takes a little getting used to since it works with reverse Polish notation. 
That means you enter numbers into a stack followed by the operator, be and 
de are described in Section (1) of the User's/System Administrator's Reference 
Manual. 

curses 

Actually a library of C functions, curses is included in this list because the 
set of functions just about amounts to a sub-language for dealing with termi- 
nal screens. If you are writing programs that include interactive user screens, 
you will want to become familiar with this group of functions. 

In addition to all the foregoing, do not overlook the possibility of using 
shell procedures. 
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The last two steps in most compUation systems in the UNIX System 
environment are the assembler and the link editor. The compilation system 
produces assembly language code. The assembler translates that code into the 
machine language of the computer the program is to run on. The link editor 
resolves all undefined references and makes the object module executable. 
With most languages on the UNIX System the assembler and link editor pro- 
duce files in what is known as the Common Object File Format (COFF). A 
common format makes it easier for utilities that depend on information in the 
object file to work on different machines running different versions of the 
UNIX System. 

In the Common Object File Format an object file contains the following: 

■ a file header 

■ optional secondary header 

■ a table of section headers 

■ data corresponding to the section header(s) 

■ relocation information 

■ line numbers 

■ a symbol table 

■ a string table 

An object file is made up of sections. Usually, there are at least two: 
.text, and .data. Some object files contain a section called .bss. (.bss is an 
assembly language pseudo-op that originally stood for "block started by sym- 
bol.") Options of the compilers cause different items of information to be 
included in the Common Object File Format. For example, compiling a 
program with the -g option adds line numbers and other symbolic informa- 
tion that is needed for the sdb (Symbolic Debugger) command to be fully 
effective. You can spend many years programming without having to worry 
too much about the contents and organization of the Common Object File 
Format, so we are not going into any further depth of detail at this point. 
Detailed information is available in Chapter 11 of this guide. 
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Compiling and Linic Editing 

The command used for compiling depends on the language used; for C 
programs, cc both compiles and link edits. 

Compiling C Programs 

To use the C compilation system you must have your source code in a file 
with a file name that ends in the characters .c, as in mycode.c. The command 
to invoke the compiler is 

cc mycode.c 

If the compilation is successful, the process proceeds through the link edit 
stage, and the result will be an executable file by the name of a.out. 

Several options to the cc command are available to control its operation. 
The most used options are: 

-c causes the compilation system to suppress the link edit 

phase. This produces an object file (mycode.c) that can be 
link edited at a later time with a cc command without the 
-c option, 

-g causes the compilation system to generate special informa- 

tion about variables and language statements used by the 
symboUc debugger sdb. If you are going through the stage 
of debugging your program, use this option. 

-O causes the inclusion of an additional optimization phase. 

This option is logically incompatible with the -g option. 
You would normally use -O after the program has been 
debugged, to reduce the size of the object file and increase 
execution speed. 

-p causes the compilation system to produce code that works 

in conjunction with the prof(l) command to produce a 
runtime profile of where the program is spending its time. 
See the Programmer's Reference Manual for the prof(l) 
manual page. This is useful in identifpng which routines 
are candidates for improved code. 
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-o outfile tells cc to tell the link editor to use the specified name for 
the executable file, rather than the default a^out. 

Other options can be used with cc. Check the Programmer's Reference 
Manual, 

If you enter the cc command using a file name that ends in .s, the compi- 
lation system treats it as assembly language source code and bypasses all the 
steps ahead of the assembly step. 

Compiler Diagnostic iWiessages 

The C compiler generates error messages for statements that do not com- 
pile. The messages are generally quite understandable, but in common with 
most language compilers they sometimes point several statements beyond 
where the actual error occurred. For example, if you inadvertently put an 
extra ; at the end of an if statement, a subsequent else will be flagged as a 
syntax error. In the case where a block of several statements follows the if, 
the line number of the syntax error caused by the else will start you looking 
for the error well past where it is. Unbalanced curly braces, { }, are another 
common producer of syntax errors. 

Linic Editing 

The Id command invokes the link editor directly. The typical user, how- 
ever, seldom invokes Id directly. A more common practice is to use a 
language compilation control command (such as cc) that invokes Id. The link 
editor combines several object files into one, performs relocation, resolves 
external symbols, incorporates startup routines, and supports symbol table 
information used by sdb. You may, of course, start with a single object file 
rather than several. The resulting executable module is left in a file named 
a.out. 

Any file named on the Id command line that is not an object file (typi- 
cally, a name ending in o) is assumed to be an archive library or a file of link 
editor directives. The Id command has some 16 options. We are going to 
describe four of them. These options should be fed to the link editor by speci- 
fjdng them on the cc command line if you are doing both jobs with the single 
command, which is the usual case. 

-o outfile provides a name to be used to replace a.out as the name of 

the output file. Obviously, the name a.out is of only tem- 
porary usefulness. If you know the name you want to use 
to invoke your program, you can provide it here. Of 
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course, it may be equally convenient to do this: 

mv a.out progname 

when you want to give your program a less temporary 
name. 

-be directs the link editor to search a library libxa, where x is 

up to nine characters. For C programs, libca is automati- 
cally searched if the cc command is used. The -be option is 
used to bring in libraries not normally in the search path 
such as libm.a, the math library. The -b: option can occur 
more than once on a command line, with different values 
for the X. A library is searched when its name is encoun- 
tered, so the placement of the option on the command line 
is important. The safest place to put it is at the end of the 
command line. The -bf option is related to the -L option. 

-L dir changes the libx*a search sequence to search in the speci- 

fied directory before looking in the default library direc- 
tories, usually /lib or /usr/lib. This is useful if you have 
different versions of a library, and you want to point the 
link editor to the correct one. It works on the assumption 
that once a library has been found no further searching for 
that library is necessary. Because -L diverts the search for 
the libraries specified by Ax options, it must precede such 
options on the command line. 

-u symname enters symname as an undefined symbol in the symbol 
table. This is useful if you are loading entirely from an 
archive library, because initially the symbol table is empty 
and needs an unresolved reference to force the loading of 
the first routine. 



When the link editor is called through cc, a startup routine (typically 
/lib/crtO.o for C programs) is linked with your program. This routine calls 
exit(2) after execution of the main program. 

The link editor accepts a file containing link editor directives. The details 
of the link editor command language can be found in Chapter 12. 
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Language and the UNIX System 

When a program is run in a computer, it depends on the operating system 
for a variety of services. Some of the services, such as bringing the program 
into main memory and starting the execution, are completely transparent to 
the program. They are, in effect, arranged for in advance by the link editor 
when it marks an object module as executable. As a programmer you seldom 
need to be'concemed about such matters. 

Other services, however, such as input/output (I/O), file management, 
and storage allocation require work on the part of the programmer. These 
connections between a program and the UNIX Operating System are what is 
meant by the term UNIX System/language interface. The following topics are 
included in this section: 

■ why C is used to illustrate the interface 

■ how arguments are passed to a program 

■ system calls and subroutines 

■ header files and libraries 

■ object file libraries 

■ input/output 

■ system calls for environment or status information 

■ processes 

■ error handling, signals, and interrupts 



Why C Is Used to Illustrate the Interface 

Throughout this section C programs are used to illustrate the interface 
between the UNIX System and programming languages, because C programs 
make more use of the interface mechanisms than other high-level languages. 
What is really being covered in this section then is the UNIX System/C 
Language interface. The way that other languages deal with these topics is 
described in the user's guides for those languages. 
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How Arguments Are Passed to a Program 

Information or control data can be passed to a C program as arguments on 
the command line. When the program is run as a command, arguments on 
the command line are made available to the function main in two parameters, 
an argument count and an array of pointers to character strings, (Every C 
program is required to have an entry module by the name of main.) Since 
the argument count is always given, the program does not have to know in 
advance how many arguments to expect. The character strings pointed at by 
elements of the array of pointers contain the argument information. 

The arguments are presented to the program traditionally as argc and 
argv, although any names you choose will work, argc is an integer that gives 
the count of the number of arguments. Since the command itself is con- 
sidered to be the first argument, argv[0], the count is always at least one. 
argv is an array of pointers to character strings (arrays of characters ter- 
minated by the null character \0). 

If you plan to pass runtime parameters to your program, you need to 
include code to deal with the information. Two possible uses of runtime 
parameters are the following: 

■ as control data. Use the information to set internal flags that control 
the operation of the program. 

■ to provide a variable file name to the program. 

Figures 2-1 and 2-2 show program fragments that illustrate these uses. 
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min(argc, argv) 
char ♦argv[]; 

{ 

void exit( ) ; 

int oflag = FALSE; 

int pflag = FALSE; /* Function Flags */ 

int rflag = FALSE; 
int ch; 

\diile ((ch = getopt(argc,argv, "cpr")) != EOF) 
{ 

/♦ Far options present, set flag to IRUE ♦/ 

/♦ If no options present, print error message ♦/ 

switch (ch) 
{ 

case 'o': 

oflag = 1; 

break; 
case 'p'; 

pflag = 1; 

break; 
case 'r': 

rflag = 1; 

break; 

default: 

( void) f printf ( stderr , 

"Usage: %s [-opr]\n", argv[0]); 

exit(2); 

} 

} 




Figure 2-1: Using Command Line Arguments to Set Flags 
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#iiK:lude <stdio.h> 




inain(argc, argv) 
int argc; 
char ♦argvE]; 



FILE *fopen(), *fin; 
void penCT( ) , exit{ ) ; 

if (argc > 1) 
{ 

if ((fin = fc3pen(argv[1], "r")) == NULL) 
< 

/♦ First string (?fis) is program name (argv[0]) */ 
/• Seccffxa string (9fe) is name of file that ocauld */ 
/* not be cpened (argv[ 1 ] ) */ 



( void ) f printf ( stderr , 
**%s: cannot open %s: 
argv[0], argv[1]); 

perrorC'"); 

e3dt(2); 



Figure 2-2: Using argv[n] Pointers to Pass a File Name 



The shell, which makes arguments available to your program, considers 
an argument to be any non-blank characters separated by blanks or tabs. 
Characters enclosed in double quotes ("abc def") are passed to the program 
as one argument, even if blanks or tabs are among the characters. It goes 
without sa)dng that you are responsible for error checking and otherwise mak- 
ing sure the argument received is what your program expects it to be. 



} 



} 
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A third argument is also present, in addition to argc and argv. The third 
argument, known as envp, is an array of pointers to environment variables. 
You can find more information on envp in the Programmer's Reference Manual 
under exec(2) and environ(5). 



System Calls and Subroutines 

System calls are requests from a program for an action to be performed by 
the UNIX System kernel. Subroutines are precoded modules used to supple- 
ment the functionality of a programming language. 

Both system calls and subroutines look like functions such as those you 
might code for the individual parts of your program. There are, however, 
differences between them: 

■ At link edit time, the code for subroutines is copied into the object file 
for your program; the code invoked by a system call remains in the 
kernel. 

■ At execution time, subroutine code is executed as if it was code you 
had written yourself; a system function call is executed by switching 
from your process area to the kernel. 

This means that while subroutines make your executable object file larger, 
runtime overhead for context switching may be less and execution may be 
faster. 

Categories off System Calls and Subroutines 

System calls divide fairly neatly into the following categories: 

■ file access 

■ file and directory manipulation 

■ process control 

■ environment control and status information 

You can generally tell the category of a subroutine by the section of the 
Programmer's Reference Manual in which you find its manual page. However, 
the first part of Section 3 (3C and 3S) covers such a variety of subroutines it 
might be helpful to classify them further. 
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The subroutines of sub-class 3S constitute the UNIX System/C 
Language standard I/O, an efficient I/O buffering scheme for C. 

The subroutines of sub-class 3C do a variety of tasks. They have in 
common the fact that their object code is stored in libca. They can be 
divided into the following categories: 

□ string manipulation 

□ character conversion 

□ character classification 

□ environment management 

□ memory management 

Figure 2-3 lists the functions that compose the standard I/O subroutines. 
Frequently, one manual page describes several related functions. The left 
hand column contains the name that appears at the top of the manual page; 
the other names in the same row are related functions described on the same 
manual page. 

Figure 2-4 lists string-handling functions that are grouped under the head- 
ing string(3C) in the Programmer's Reference Manual, 

Figure 2-5 lists macros that classify ASCII character-coded integer values. 
These macros are described under the heading ctype(3C) in Section 3 of the 
Programmer's Reference Manual 

Figure 2-6 lists functions and macros that are used to convert characters, 
integers, or strings from one representation to another. 
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Function Name(s) 


Purpose 


ferror 


feof 


clearerr 


Hleno 


stream status inquiries 


fopen 


ircopcn 


liiupcii 




open a stream 




fwnte 






binary input/output 


fseek 


rewind 


fteU 




reposition a file pointer in a 
stream 


getc 








get a character or word from a 
stream 




feets 






get a string from a stream 


popen 








begin or end a pipe to/from a 
process 


printf 


fprintf 


sprintf 

* 




print formatted output 


putc 




fputc 


putw 


put a character or word on a 
stream 


puts 


fputs 






put a string on a stream 


scant 


iscani 


sscani 




convert formatted input 


setbuf 


setvbuf 






assign buffering to a stream 


system 








issue a command through the 








shell 


tmpfile 








create a temporary file 


tmpnam 


tempnam 






create a name for a temporary 








file 


ungetc 








push character back into input 








stream 


vprintf 


vfprintf 


vsprintf 




print formatted output of a 
varargs argument list 



For all functions: #include <stdio.h> 

The function name shown in column 1 (for example, 

ferror) gives the location in 

the Programmer's Reference Manual, Section 3. 

Figure 2-3: C Language Standard I/O Subroutines 
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String Operations 


strcat(sl, 82) 


append a copy of s2 to the end of si. 


stmcat(sl, s2, n) 


append n characters from s2 to the end of si. 


strcmp(sl, s2) 


compare two strings. Returns an integer less than, 
greater than, or equal to to show that si is lexico- 
graphically less than, greater than, or equal to s2. 


strncmp(sl, s2, n) 


compare n characters from the two strings. Results are 
otherwise identical to strcmp. 


strq)y(sl, s2) 


copy s2 to si, stopping after the null character (\0) has 
been copied. 


stmq)y(sl, s2, n) 


copy n characters from s2 to si. s2 will be truncated if 
it is longer than n, or padded with null characters if it is 
shorter than n. 


strdup(8) 


returns a pointer to a new string that is a duplicate of 
the string pointed to by s. 


strchr(s, c) 


returns a pointer to the first occurrence of character c in 
string s, or a NULL pointer if c is not in s. 


strrchr(s, c) 


returns a pointer to the last occurrence of character c in 
string s, or a NULL pointer if c is not in s. 


strlen(s) 


returns the number of characters in s up to the first null 
character. 


strpbrk(sl, s2) 


returns a pointer to the first occurrence in si of any 
character from s2, or a NULL pointer if no character 
from s2 occurs in si. 


strspn(sl, s2) 


returns the length of the initial segment of si, which 
consists entirely of characters from s2. 


strc8pn(8l, s2) 


returns the length of the initial segment of si, which 
consists entirely of characters not from s2. 


strtok(sl, s2) 


look for occurrences of s2 within si. 



For all functions: #include <string.h> 

string.h provides extern definitions of the string functions. 



Figure 2-4: String Operations 



2-18 PROGRAMMER'S GUIDE 



The Interface Between a Programming Language and the UNIX System 



Classify Characters 



isalpha(c) 


is c a letter 


isuppcric; 


ic an iTnriPfT;i<sP IptfrPT 


islower(c) 


is c a lowercase letter 


isdigit(c) 


is c a digit [0-9] 


isxdigit(c) 


is c a hexadecimal digit [0-9], [A-F] or [a-f] 


isalnuin(c) 


is c an alphanumeric (letter or digit) 


isspace(c) 


is c a space, tab, carriage return, new-line, vertical tab. 




or form-feed 


ispunct(c) 


is c a punctuation character (neither control nor 




alphanumeric) 


isprint(c) 


is c a printing character, code 040 (space) through 0176 




(tilde) 


isgraph(c) 


same as isprint except false for 040 (space) 


iscntrl(c) 


is c a control character (less than 040) or a delete char- 




acter (0177) 


isascii(c) 


is c an ASCII character (code less than 0200) 



For all functions: #include <ctype.h> 
Nonzero return == true; zero return == false 

Figure 2-5: Classifying ASCII Character-Coded Integer Values 
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Function Name(s) 


Purpose 




Una. 


convert between long integer and 
base-64 ASCII string 


ecvt 


fcvt gcvt 


convert floating-point number to string 


I3tol 


ltol3 


convert between 3-byte integer and 
long integer 


strtod 


atof 


convert string to double-precision 
number 


strtol 


atol atoi 


convert string to integer 



conv(3C): 


Translate Characters 


toupper 


lowercase to uppercase 


_toupper 


macro version of toupper 


tolower 


uppercase to lowercase 


_toIower 


macro version of tolower 


toascii 


turn off all bits that are not part of a standard ASCII character; 




intended for compatibility with other systems 



For all conv(3C) macros: #include <ctype.h> 
Figure 2-6: Conversion Functions and Macros 
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Where the Manual Pages Can Be Found 

System calls are listed alphabetically in Section 2 of the Programmer's 
Reference Manual Subroutines are listed in Section 3. We have described ear- 
lier what is in the first subsection of Section 3. The remaining subsections of 
Section 3 are: 

■ 3M — functions that make up the Math Library, libm 

■ 3X — various specialized functions 

■ 3N — Networking Support Utilities 

How System Calls and Subroutines Are Used in C 
Programs 

Information about the proper way to use system calls and subroutines is 
given on the manual page, but you have to know what you are looking for 
before it begins to make sense. To illustrate, a typical manual page [for 
get8(3S)] is shown in Figure 2-7. 
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NAME 

gets, fgets - get a string from a stream 

SYNOPSIS 

#mclude <stdio.h> 

char *gets (s) 
char *s; 

char «fgets (s, stream) 
char *s; 
int n; 

FILE ^stream; 
DESCRIPTION 

Gets reads characters from the standard input stream, stdin, into the 
array pointed to by s, until a new-line character is read, or an end-of- 
file condition is encountered. The new-line character is discarded, and 
the string is terminated with a null character. 

Fgets reads characters from the stream into the array pointed to by s, 
until n-1 characters are read, or a new-line character is read and 
transferred to s, or an end-of-file condition is encountered. The string 
is then terminated with a null character. 

SEE ALSO 

ferror(3S), 

fopen(3S), 

fread(3S), 

getc(3S), 

scanf(3S), 

DIAGNOSTICS 

If end-of-file is encountered and no characters have been read, no 
characters are transferred to s, and a NULL pointer is returned. If a 
read error occurs, such as trying to use these functions on a file that 
has not been opened for reading, a NULL pointer is returned. Other- 
wise s is returned. 
Figure 2-7: Manual Page for get8(3S) 



2-22 PROGRAMMER'S GUIDE 



The Interface Between a Programming Language and the UNIX System 



As you can see from the illustration, two related functions are described 
on this page: gets and fgets. Each function gets a string from a stream in a 
slightly different way. The DESCRIPTION section tells how each operates. 

It is the SYNOPSIS section, however, that contains the critical information 
about how the function (or macro) is used in your program. Notice in 
Figure 2-7 that the first line in the SYNOPSIS is 

#include <stdio.h> 

This means that to use gets or fgets you must bring the standard I/O header 
file into your program (generally right at the top of the file). There is some- 
thing in stdio.h that is needed when you use the described functions. 
Figure 2-9 shows a version of stdich. Check it to see if you can understand 
what gets or fgets uses. 

The next thing shown in the SYNOPSIS section of a manual page that 
documents system calls or subroutines is the formal declaration of the func- 
tion. The formal declaration tells you: 

■ the type of object returned by the function 

In our example, both gets and fgets return a character pointer, 

■ the object or objects the function expects to receive when called 

These are the things enclosed in the parentheses of the function, gets 
expects a character pointer. (The DESCRIPTION section describes 
what the tokens of the formal declaration stand for.) 

■ how the function is going to treat those objects 
The declaration 

char *s; 

in gets means that the token s enclosed in the parentheses will be con- 
sidered to be a pointer to a character string. Bear in mind that in the C 
language, when passed as an argument, the name of an array is con- 
verted to a pointer to the beginning of the array. 
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We have chosen a simple example here in gets. If you want to test your- 
self on something a little more complex, try working out the meaning of the 
elements of the fgets declaration. 

While we are on the subject of fgets^ there is another piece of C esoterica 
that we will explain. Notice that the third parameter in the fgets declaration 
is referred to as stream, A stream, in this context, is a file with its associated 
buffering. It is declared to be a pointer to a defined type FILE. Where is FILE 
defined? Right! In stdich. 

To finish off this discussion of the way you use functions described in the 
Programmer's Reference Manual in your own code, in Figure 2-8 we show a 
program fragment in which gets is used. 



#ijiclude <stdio.h> 

mainC ) 
{ 

char sarray[80]; 

for(;;) 
{ 

if (gets(sarray) 1= NOLL) 

/* Do scroething with the string */ 



Figure 2-8: How gets Is Used in a Program 



You might ask, "Where is gets reading from?" The answer is, "From the 
standard input. " That generally means from something being keyed in from 
the keyboard or output from another command that was piped to gets. How 
do we know that? The DESCRIPTION section of the gets manual page says, 
"gets reads characters from the standard input...." Where is the standard 
input defined? In stdich. 
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#i£i¥aef _NFILE 
#def ine _NFILE20 

^define BDFSIZ1024 
#de£ine SBFSIZ 8 



typedef struct { 
int 



_cait; 

unsigned char ♦_jptr; 
unsigned char ♦^base; 



char 
char 



_flag; 
_file; 



} FILE; 



#def ine JEOFBF 
#defijae JECHIEAD 
#def ine JECWEtT 
#def iue JEONBF 
#def ine _IC3MYBUF 
#def ine JEOBQP 
#def ine JEOJTO 
#def ins JEOLBF 
#def ine JEORW 

#ifirxaef NULL 
#def ine NULL 
#ewaif 
#ifndef EOF 
#def ine BQF 
#exxiL£ 



0000 
0001 
0002 
0004 
0010 
0020 
0040 
0100 
0200 



/♦ JEOLBF means that a file's output */ 
/* will be buffered line hy line. */ 
/* In addition to being flags, JECNBF,*/ 
/* JEOLBF and lOFBF are possible ♦/ 
/* values for "type" in setvbuf . */ 



(-1) 



Figure 2-9: A Version of stdio.h (Sheet 1 of 2) 
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#de£iiie stdin 
^tdefine stdout 
^define stderr 

#define _bufend(p) 
#dfi£aiie _bufsiz(p) 

#ifndef lint 
#dfifiiie getc(p) 
#de£ixie putc(x, p) 



^define getcharC ) 
^define putchar(x} 
^defaxie cl6arezr(p) 
#define feof (p) 
#definB ferror(p) 
#define fileno(p) 
#endi£ 



(&_iob[0]) 
(&_ici>[2]) 

J3ufendtab[ (p)->_f ile] 
(_bufend(p) - (p)->_base) 



(--(p)->_cait < ? _filbuf (p) : (int) *(p)->j3tr++) 
(--(P)->_cnt < ? 
_f lsbii£( (unsigned char) (x) , (p) ) : 
(int) (*(p)->_ptr++ = (unsigned char) (x))) 
getc( stdin) 
putc((x), stdout) 

((void) ((p)->_flag (_iaERR | jrCBOF))) 
((P)-=^flag & _IQBQF) 
((P)->Lflag & _I0HIR) 
(p)->file 



extern FILE _iob(_NErLE] ; 

extern FILE ♦fopenO, •fdcjpenO. ♦freopenO, *popen(), nniifileO; 

extern Icang ftell( ) ; 

extern void rewind( ) , setbu£( ) ; 

extern char *ctennid(), ♦cuserid(), ♦fgetsO, *gets(), *tai5aiam(), *tnpnam(); 
extern unsigned char ♦Jxif endtabC ] ; 



#def ine L_ctermid 
#def ine L_cuserid 
#define PJiipdir 
#def ine LJbnpnaTn 
#endif 



9 
9 

"Aisr/tanp/" 

(sizeof(PJtn53dir) + 15) 



Figure 2-9: A Version of stdio.h (Sheet 2 of 2) 
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Header Files and Libraries 

In the earlier parts of this chapter there have been frequent references to 
stdich, and a version of the file itself is shown in Figure 2-9. stdio«h is the 
most conmionly used header file in the UNIX System/C environment, but 
there are many others. 

Header files carry definitions and declarations that are used by more than 
one function. Header file names traditionally have the suffix .h, and are 
brought into a program at compile time by the C-preprocessor. The prepro- 
cessor does this because it interprets the #include statement in your program 
as a directive; as indeed it is. All keywords preceded by a pound sign (#) at 
the beginning of the line are treated as preprocessor directives. The two most 
commonly used directives are #include and #define. We have already seen 
that the #include directive is used to call in (and process) the contents of the 
named file. The #define directive is used to replace a name with a token- 
string. For example, 

#de(ine -NFILE 20 

sets to 20 the number of files a program can have open at one time. See 
cpp(l) for the complete list. 

In the pages of the Programmer's Reference Manual there are about 45 dif- 
ferent .h files named. The format of the #include statement for all these 
shows the file name enclosed in angle brackets (<>), as in 

#include <stdio.h> 

The angle brackets tell the C preprocessor to look in the standard places 
for the file. In most systems the standard place is in the /usr/include direc- 
tory. If you have some definitions or external declarations that you want to 
make available in several files, you can create a .h file v^dth any editor, store it 
in a convenient directory and make it the subject of a #include statement 
such as the following: 

#include " • . /defs /rec.h " 
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It is necessary, in this case, to provide the relative path name of the file 
and enclose it in quotation marks ( " " ). Fully-qualified path names (those that 
begin v^ith /) can create portability and organizational problems. An alterna- 
tive to long or fully-qualified path names is to use the -Idir preprocessor 
option when you compile the program. This option directs the preprocessor 
to search for #include files whose names are enclosed in " first in the 
directory of the file being compiled, then in the directories named in the -I 
option(s), and finally in directories on the standard list. In addition, all 
#include files whose names are enclosed in angle brackets (< >) are first 
searched for in the list of directories named in the -I option and finally in the 
directories on the standard list. 



Object File Libraries 

It is common practice in UNIX System computers to keep modules of 
compiled code (object files) in archives; by convention, designated by a .a suf- 
fix. System calls from Section 2 and the subroutines in Section 3, 
subsections 3C and 3S of the Programmer's Reference Manual that are functions 
(as distinct from macros), are kept in an archive file by the name of libera. In 
most systems, libc.a is found in the directory /lib. Many systems also have a 
directory /usr/lib. Where both /lib and /usr/lib occur, /usr/lib is apt to be 
used to hold archives that are related to specific applications. 

During the link edit phase of the compilation and link edit process, copies 
of some of the object modules in an archive file are loaded with your execut- 
able code. By default, the cc command that invokes the C compilation system 
causes the link editor to search libca. If you need to point the link editor to 
other libraries that are not searched by default, you do it by naming them 
explicitly on the command line with the -1 option. The format of the -1 option 
is 'Ix, where x is the library name, and can be up to nine characters. For 
example, if your program includes functions from the curses screen control 
package, the option 

-Icurses 

will cause the link editor to search for /lib/libcurses.a or 
/usr/lib/Iibcurses.a and use the first one it finds to resolve references in your 
program. 
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In cases where you want to direct the order in which archive libraries are 
searched^ you may use the -L dir option. Assuming the -L option appears on 
the command line ahead of the -1 option, it directs the link editor to search 
the named directory for libx.a before looking in /lib and /usr/lib. This is 
particularly useful if you are testing out a new version of a function that 
already exists in an archive in a standard directory. Its success is due to the 
fact that once having resolved a reference, the link editor stops looking. That 
is why the -L option, if used, should appear on the command line ahead of 
any -1 specification. 



Input/Output 

We talked some about I/O earlier in this chapter in connection with sys- 
tem calls and subroutines. A whole set of subroutines constitutes the C 
language standard I/O package, and there are several system calls that deal 
with the same area. In this section we want to get into the subject in a little 
more detail and describe for you how to deal with input and output concerns 
in your C programs. First off, let us briefly define what the subject of I/O 
encompasses. It has to do with 

■ creating and sometimes removing files 

■ opening and closing files used by your program 

■ transferring information from a file to your program (reading) 

■ transferring information from your program to a file (writing). 

In this section we will describe some of the subroutines you might choose 
for transferring information, but the heaviest emphasis will be on dealing with 
files. 

Three Files You Always Have 

Programs are permitted to have several files open simultaneously. The 
number may vary from system to system; the most common maximum is 20. 
—NFILE in stdich specifies the number of standard I/O FILEs a program is 
permitted to have open. 

Any program automatically starts off with three files. If you will look 
again at Figure 2-9, about midway through you will see that stdich contains 
three #define directives that equate stdin, stdout, and stderr to the address 
of _iob[0], _iob[l], and _iob[2], respectively. The array _iob holds informa- 
tion dealing with the way standard I/O handles streams. It is a 
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representation of the open file table in the control block for your program. 
The position in the array is a number that is also known as the file descriptor. 
The default in UNIX Systems is to associate all three of these files with your 
terminal. 

The real significance is that functions and macros that deal with stdin or 
stdout can be used in your program with no further need to open or close 
files. For example, gets, cited above, reads a string from stdin; puts writes a 
null-terminated string to stdout. There are others that do the same thing (in 
slightly different ways: character at a time, formatted, etc.). You can specify 
that output be directed to stderr by using a function such as fprintf. fprintf 
works the same as printf except that it delivers its formatted output to a 
named stream, such as stderr. You can use the shell's redirection feature on 
the command line to read from or write into a named file. If you want to 
separate error messages from ordinary output being sent to stdout, and thence 
possibly piped by the shell to a succeeding program, you can do it by using 
one function to handle the ordinary output and a variation of the same func- 
tion that names the stream to handle error messages. 

Named Files 

Any files other than stdin, stdout, and stderr that are to be used by your 
program must be explicitiy connected by you before the file can be read from 
or written to. This can be done using the standard library routine fopen. 
fopen takes a path name (which is the name by which the file is known to the 
UNIX System file system), asks the system to keep track of the connection, 
and returns a pointer that you then use in functions that do the reads and 
writes. 

A structure is defined in stdio.h with a type of FILE. In your program 
you need to have a declaration such as 

FILE *fin; 

The declaration says that fin is a pointer to a FILE. You can then assign the 
name of a particular file to the pointer with a statement in your program like 
the foUov^ng: 

fin = fopen( "filename" , "r"); 

where filename is the path name to open. The "r" means that the file is to 
be opened for reading. This argument is known as the mode. As you might 
suspect, there are modes for reading, writing, and both reading and writing. 
Actually, the file open function is often included in an if statement that takes 
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advantage of the fact that fopen returns a NULL pointer if it cannot open the 
file. An example is 



if ((fin = f<55en( "filename", "r")) == NOLL) 

(void)fpriiitf (stdfirr,"5fe: Unable to qpen input file 9fe\n",argv[0],"filenaine") ; 



Once the file has been successfully opened, the pointer fin is used in 
functions (or macros) to refer to the file. For example, 

int c; 

c = getc(fin); 

brings in a character at a time from the file into an integer variable called c. 
The variable c is declared as an integer even though we are reading characters 
because the function getcO returns an integer. Getting a character is often 
incorporated into some flow-of-control mechanism such as, 

while {(c = getc(fin)) != BOP) 



that reads through the file until EOF is returned. EOF, NULL, and the macro 
getc are all defined in stdich. getc and others that make up the standard I/O 
package keep advancing a pointer through the buffer associated with the file; 
the UNIX System and the standard I/O subroutines are responsible for seeing 
that the buffer is refilled (or written to the output file if you are producing 
output) when the pointer reaches the end of the buffer. All these mechanics 
are mercifully invisible to the program and the programmer. 

The function fclose is used to break the connection between the pointer in 
your program and the path name. The pointer may then be associated with 
another file by another call to fopen. This re-use of a file descriptor for a dif- 
ferent stream may be necessary if your program has many files to open. For 
output files it is good to issue an fclose call, because the call makes sure that 
all output has been sent from the output buffer before disconnecting the file. 
The system call exit closes all open files for you. It also gets you completely 
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out of your process, however, so it is safe to use only when you are sure you 
are completely finished, 

Low-Level I/O and Why You Should Not Use It 

The term low-level I/O is used to refer to the process of using system 
calls from Section 2 of the Programmers Reference Manual rather than the 
functions and subroutines of the standard I/O package. We are going to post- 
pone until Chapter 3 any discussion of when this might be advantageous. If 
you find as you go through the information in this chapter that it is a good fit 
with the objectives you have as a programmer, it is a safe assumption that 
you can work with C language programs in the UNIX System for a good 
many years without ever having a real need to use system calls to handle 
your I/O and file accessing problems. The reason low-level I/O is perilous is 
because it is more system-dependent. Your programs are less portable and 
probably no more efficient. 

System Calls for Environment or Status 
information 

Under some circumstances you might want to be able to monitor or con- 
trol the environment in your computer. There are system calls that can be 
used for this purpose. Some of them are shown in Figure 2-10. 
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Function Name(s) 


Purpose 


chdir 






change working directory 


chmod 






change access permission of a file 


chown 






change owner and group of a file 


getpid 


getpgrp 


getppid 


get process IDs 


getuid 


geteuid 


getgid 


get user IDs 


ioctl 






control device 


link 


unlink 




add or remove a directory entry 


mount 


umount 




mount or unmount a file system 


nice 






change priority of a process 


Stat 


fstat 




get file status 


time 






get time 


ulimit 






get and set user limits 


uname 






get name of current UNIX System 



Figure 2-10: Environment and Status System Calls 



As you can see, many of the functions shown in Figure 2-10 have 
equivalent UNIX System shell commands. Shell commands can easily be 
incorporated into shell scripts to accomplish the monitoring and control tasks 
you may need to do. The functions are available, however, and may be used 
in C programs as part of the UNIX System/C Language interface. They are 
documented in Section 2 of the Programmefs Reference Manual 



Processes 

Whenever you execute a command in the UNIX System you are initiating 
a process that is numbered and tracked by the operating system. A flexible 
feature of the UNIX System is that processes can be generated by other 
processes. This happens more than you might ever be aware of. For exam- 
ple, when you log in to your system, you are running a process, very probably 
the shell. If you then use an editor such as vi, take the option of invoking the 
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shell from vi and execute the ps command; you will see a display something 
like that in Figure 2-11 (which shows the results of a ps -f command): 



UID 


PID 


PPID 


C 


STIME 


TTY 


TIME 


COMMAND 


abc 


24210 


1 





06:13:14 


tty29 


0:05 


-sh 


abc 


24631 


24210 





06:59:07 


tty29 


0:13 


vi c2.uli 


abc 


28441 


28358 


80 


09:17:22 


tty29 


0:01 


ps -f 


abc 


28358 


24631 


2 


09:15:14 


tty29 


0:01 


sh -i 



Figure 2-11: Process Status 



As you can see, user abc (who went through the steps described above) 
now has four processes active. It is an interesting exercise to trace the chain 
that is shown in the Process ID (PID) and Parent Process ID (PPID) columns. 
The shell that was started when user abc logged on is Process 24210; its 
parent is the initialization process (Process ID 1). Process 24210 is the parent 
of Process 24631, and so on. 

The four processes in the example above are all UNIX System shell level 
commands, but you can spawn new processes from your own program. 
(Actually, when you issue the command from your terminal to execute a pro- 
gram you are asking the shell to start another process, the process being your 
executable object module with all the functions and subroutines that were 
made a part of it by the link editor.) 

You might think, "Well, it's one thing to switch from one program to 
another when I'm at my terminal working interactively with the computer; but 
why would a program want to run other programs, and if one does, why 
wouldn't I just put everything together into one big executable module?" 

Overlooking the case where your program is itself an interactive applica- 
tion with diverse choices for the user, your program may need to run one or 
more other programs based on conditions it encounters in its own processing. 
(If it is the end of the month, go do a trial balance, for example.) The usual 
reasons why it might not be practical to create one large executable module 
follow: 

■ The load module may get too big to fit in the maximum process size 
for your system. 
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■ You may not have control over the object code of all the other 
modules you want to include. 

Suffice it to say, there are legitimate reasons why this creation of new 
processes might need to be done. There are three ways to do it: 

■ system(3S) — requests the shell to execute a command 

■ exec(2) — stops this process and starts another 

■ fork(2) — starts an additional copy of this process 

system(3S) 

The formal declaration of the system function looks like the following: 
#iiiclude <stdio.h> 

int system( string) 
char *string; 

The function asks the shell to treat the string as a command line. The string 
can, therefore, be the name and arguments of any executable program or 
UNIX System shell command. If the exact arguments vary from one execution 
to the next, you may want to use sprintf to format the string before issuing 
the system command. When the command has finished running, system 
returns the shell exit status to your program. Execution of your program waits 
for the completion of the command initiated by system and then picks up 
again at the next executable statement. 

exec(2) 

exec is the name of a family of functions that includes execv, execle, 
execve, execlp, and execvp. They all have the function of transforming the 
calling process into a new process. The reason for the variety is to provide 
different ways of pulling together and presenting the arguments of the func- 
tion. An example of one version (execl) might be: 

execl("/bin/prog2", "pax)g", pcogargi, progarg2, (char *)0); 
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For execl the argument list is 

/bin/prog2 path name of the new process file 

prog the name the new process gets in its argv[0] 

progargl, arguments to progl as char *'s 
progarg2 

(char *)0 a null char pointer to mark the end of the arguments 

Check the exec(2) manual page in the Programmer's Reference Manual for 
the rest of the details. The key point of the exec family is that there is no 
return from a successful execution: the calling process is finished, the new 
process overlays the old. The new process also takes over the Process ID and 
other attributes of the old process. If the call to exec is unsuccessful, control 
is returned to your program with a return value of -1. You can check errno 
(see below) to learn why it failed. 

fork(2) 

The fork system call creates a new process that is an exact copy of the cal- 
ling process. The new process is known as the child process; the caller is 
known as the parent process. The one major difference between the two 
processes is that the child gets its own unique process ID. When the fork pro- 
cess has completed successfully, it returns a to the child process and the 
child's process ID to the parent. If the idea of having two identical processes 
seems a little funny, consider this: 

■ Because the return value is different between the child process and the 
parent, the program can contain the logic to determine different paths. 

■ The child process could say, "Okay, I'm the child. I'm supposed to 
issue an exec for an entirely different program. " 

■ The parent process could say, "My child is going to be execing a new 
process. I'll issue a wait until I get word that that process is finished. " 

To take this out of the storybook world where programs talk like people and 
into the world of C programming (where people talk like programs), your 
code might include statements like this: 
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#iiKrltxle <ermo.h> 

int dhjstat, chj>id, status; 
char *progarg1; 
char *progarg2; 
void exit{); 
extern int ermo; 

if ((chjjid = forkO) < 0) 
{ 

/* Could not fork. , , 
check ermo 

*/ 

} 

else if (ch__pid ==0) /* child */ 

{ 

(void)execl("/aDin/p]X)g2•^"prog'^plX)garg1,pElD^^ ♦)0) ; 

exit(2); /♦ execl() failed */ 

} 

else /* parent */ 

{ 

vAiile ((status = wait(&ch_stat) ) 1= ch_pid) 
{ 

if (status < && ermo -» BCKCLD) 

break; 
ermo = 0; 

} 

} 



Figure 2-12: Example of fork 



Because the child process ID is taken over by the new exec'd process, the 
parent knows the ID, What this boils down to is a way of leaving one pro- 
gram to run another, returning to the point in the first program where pro- 
cessing left off. This is exactly what the system(3S) function does. As a 
matter of fact, system accomplishes it through this same procedure of forking 
and execing, with a wait in the parent. 
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Keep in mind that the fragment of code above includes a minimum 
amount of checking for error conditions. There is also potential confusion 
about open files and about which program is writing to a file. Leaving out the 
possibility of named files, the new process created by the fork or exec has the 
three standard files that are automatically opened: stdin, stdout, and stderr. 
If the parent has buffered output that should appear before output from the 
child, the buffers must be flushed before the fork. Also, if the parent and the 
child process both read input from a stream, whatever is read by one process 
will be lost to the other. That is, once something has been delivered from the 
input buffer to a process the pointer has moved on. 

Pipes 

The idea of using pipes, a connection between the output of one program 
and the input of another, when working with commands executed by the shell 
is well established in the UNIX System environment. For example, to learn 
the number of archive files in your system you might enter a command like 

echo /lib/*.a /usr/lib/*.a I wc -w 

that first echoes all the files in /lib and /usr/lib that end in .a, then pipes the 
results to the wc command, which counts their number, 

A feature of the UNIX System/C Language interface is the ability to 
establish pipe connections between your process and a command to be exe- 
cuted by the shell, or between two cooperating processes. The first uses the 
popen(3S) subroutine that is part of the standard I/O package; the second 
requires the system call pipe(2). 

popen is similar in concept to the system subroutine in that it causes the 
shell to execute a command. The difference is that once having invoked 
popen from your program, you have established an open line to a con- 
currently running process through a stream. You can send characters or 
strings to this stream with standard I/O subroutines just as you would to 
stdout or to a named file. The connection remains open until your program 
invokes the companion pclose subroutine. 
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A common application of this technique might be a pipe to a printer spooler. 
For example. 



#incluae <stdio.h> 



aain( ) 
{ 



FUE *pptr; 
char *oatstring; 

if ((pptr = popen("lp","W')) 1= NULL) 
{ 

far(;;) 
< 

/♦ Organize output */ 
(vDid)fpriiitf{pptr, "%s\n", outstrdug); 



pclose(FPtr) ; 
} 



Figure 2-13: Example of a popen pipe 
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Error Handling 

Within your C programs you must determine the appropriate level of 
checking for valid data and for acceptable return codes from functions and 
subroutines. If you use any of the system calls described in Section 2 of the 
Programmer's Reference Manual, you have a way in which you can find out the 
probable cause of a bad return value. 

UNIX System calls that are not able to complete successfully almost 
always return a value of -1 to your program. (If you look through the system 
calls in Section 2, you will see that there are a few calls for which no return 
value is defined, but they are the exceptions.) In addition to the -1 that is 
returned to the program, the unsuccessful system call places an integer in an 
externally declared variable, errno. You can determine the value in errno if 
your program contains the statement 

#iiiclude <ermo.h> 

The value in errno is not cleared on successful calls, so your program 
should check it only if the system call returned a -1. Errors are described in 
intro(2) of the Programmer's Reference Manual. 

The subroutine perror(3C) can be used to print an error message (on 
stderr) based on the value of errno. 



Signals and interrupts 

Signals and interrupts are two words for the same thing. Both words refer 
to messages passed by the UNIX System to running processes. Generally, the 
effect is to cause the process to stop running. Some signals are generated if 
the process attempts to do something illegal; others can be initiated by a user 
against his or her own processes, or by the super-user against any process. 

There is a system call, kill, that you can include in your program to send 
signals to other processes running under your user-id. The format for the kill 
call is 

kill(pid, sig) 

where pid is the process number against which the call is directed, and sig is 
an integer from 1 to 19 that shows the intent of the message. The name 
"kill" is something of an overstatement; not all the messages have a "drop 
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dead" meaning. Some of the available signals are shown in Figure 2-14 as 
they are defined in <sys/signaLh>. 




#aeriiie 






/ 1 Ifl 1 rj 1 1 j r 


#de£xne 








#cief ins 




3 






SIGILL 


4 


/* illegal instruction (not reset \dTen caught)*/ 


^define 




5 


/* trace trap (not reset \dien caught) */ 


#de£ine 


SXGXCjT 


b 




#de£ine 


SIGABBT 




/* iicMwi iw alMrt iretilacG STGIcyT in the future */ 


#de£ine 


SIGEMT 


7 


/* EMT iiistructicn */ 


^define 


SIGfTE 


8 


/* floating point exception */ 


#de£±ne 


SXQCCLL 


9 


/* kill (cannot be cau^t or ignored) */ 


jflcdefine 


SIGSUS 


10 


/♦ bus error */ 


#defiiie 


SIGSB5V 


11 


/* segmentation violation */ 


#def±ne 


SIGSYS 


12 


/* bad argument to system call */ 


#de£ine 


SIGPIPE 


13 


/* write on a pipe with no one to read it */ 


#de£ine 


SIGAIItM 


14 


/* alarm clock */ 


#de£ine 




15 


/* softvgore termination signal from kill */ 


#de£ine 


SIGUSR1 


16 


/* \aser defined signal 1 */ 


#define 


SIGUSR2 


17 


/* \aser defined signal 2 */ 


#de£ine 


SIGCID 


18 


/* death of a child */ 


#define 


SIGrai 


19 


/♦ power-fail restart */ 








/* SIQWIND and SIGPHONE only lased in UNIX/PC */ 


/*#clef ine SICaOND 


20*/ 


/♦ windcw change */ 


/«#def dJie SIGPHDNE 


21V 


/♦ handset, line status change */ 


^define 


SIGFQLL 22 


/* pollable event cxxwrred */ 


igedefine 


NSIG 


23 


/* Ohe valid signal number is fran 1 to NSIG-1 */ 


#de£ine 


MAXSIG 


32 


/* size of u_signalt], NSIG-1 <= MA3CSIG*/ 



/* MA3CSIG is larger than we need new. */ 
/* m the future, we can add more signal */ 
/* msEber without changing user.h */ 




Figure 2-14: Signal Numbers Defined in /usr/include/sys/signaLh 
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The signal(2) system call is designed to let you code methods of dealing 
with incoming signals. You have a three-way choice. You can (a) accept 
whatever the default action is for the signal, (b) have your program ignore the 
signal, or (c) write a function of your own to deal with it. 
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The UNIX System provides several commands designed to help you dis- 
cover the causes of problems in programs and to learn about potential prob- 
lems. 



Sample Program 

To illustrate how these commands are used and the type of output they 
produce, we have constructed a sample program that opens and reads an 
input file and performs one to three subroutines, according to options speci- 
fied on the command line. This program does not do anything you could not 
do quite easily on your pocket calculator, but it does serve to illustrate some 
points. The source code is shown in Figure 2-15. The header file, recdefh, is 
shown at the end of the source code. 

The output produced by the various analysis and debugging tools illus- 
trated in this section may vary slightly from one installation to another. The 
Programmer's Reference Manual is a good source of additional information 
about the contents of the reports. 
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tfinclude <stdio.h?' 
#incl\3de "reodef .h" 

#de£iiie ISSJE 1 
ifdefine FALSE 

iimn(argc, azgv) 
int argc; 
char ♦argv[]; 
{ 

FILE *fopen(), ♦fin; 
void exit( ) ; 
int getqpt{ ) ; 
int of lag = FALSE; 
int pflag = FALSE; 
int rflag = FALSE; 
int ch; 

struct rec first; 

extern int cpterr; 

extern float qppty( ) , pft( ) , rf e( ) ; 



/* restate.c is continued on the next page */ 




Figure 2-15: Source Code for Sample Program (Sheet 1 of 4) 
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if (argc < 2) 
{ 

(void) fprintf(stderr, "5fe: Must specify optian\n",argv[0]) ; 
(void) fprintfCstderr, "Usage: %s -rpoNn", argv[0]); 
e3d.t(2); 

} 

opterr = FALSE; 

vAiile ((ch = getppt(argc,argv,"opr")) 1= EOF) 
{ 

swxtch(ch) 
{ 

case 'o': 

oflag = TRUE; 

break; 
case 'p': 

pflag = OROE; 

break; 
case 'r': 

rflag = TRUE; 

break; 
defaiilt: 

(void) fpriiitf(stderr, "Usage: %s -rpo\n",argv[0]) ; 
e3cit(2); 

} 

} 

if ((fin = fppen("iiifo","r")) == NULL) 
{ 

(void) fprintf (stderr, "9fe: cannot open ii^put file %s\n",argv[0] ."iufo") ; 
exit(2); 




Figure 2-15: Source Code for Sample Program (Sheet 2 of 4) 
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/♦ restate.c ocmtiiiued */ 



if (fscanf(fin, "%s%f36fXfXf%f%f",first.pnainB,&first.i]pK, 
&first.<3^,6first.i,S±irst,c,Sfirst.t,&first.spx) != 7) 
{ 

(void) fprintf (stderr,"%s: cannot read first reooEd from 9fe\n", 

argv[03,"info"); 
exLt{2); 

} 

printf( "Property: %s\n",first.pBMme) ; 
if (oflag) 

printf ( " Oppc a rtunity Cost: $?6jf5.2f\n" ,oppty(&first) ) ; 
if(pflag) 

printf ( "Anticipated Profit(loss) : $%#7.2f\n",pft(&first) ) ; 
if (rflag) 

printf ("Return on Fmds Employed: ?6Jf3.2f9©6Sn",rfe(6.first) ); 



#inclijde "recdef .h" 

float 
oppty(ps) 
struct rec *ps; 
{ 

retum(ps->i/12 ♦ ps->t * ps->c^); 

} 



Figure 2-15: Source Code for Sample Program (Sheet 3 of 4) 



} 



/* End of Main Mcxble ~ restate.c */ 



/* O p p o rtunity Cost — oppty.c */ 
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/♦ Profit — pft.c ♦/ 




#iiiclude "reodef .h" 

float 
pft(ps) 

struct rec *psj 
{ 

retum(ps->sp)c - ps->ppx + ps->c); 

} 



/* Return on Rmds Etaplcyed — rfe.c */ 



#iiiclude "recxaef .h" 

float 
rfe(ps) 

struct rec *ps; 
{ 

retum(100 * (ps->spx - ps->c) / ps->spjc); 

} 



/♦ Header File ~ recdef.h ♦/ 



char pnaine[25]; 
float pp3c; 
float €ap\ 
float i; 
float c; 
float t; 
float spK; 



Figure 2-15: Source Code for Sample Program (Sheet 4 of 4) 



struct rec { 



/* To hold input V 



} ; 
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cflow 

cflow produces a chart of the external references in C, yacc, lex, and 
assembly language files. Using the modules of our sample program, the com- 
mand 

cflow restate.c opptyx pft.c rfe.c 

produces the output shown in Figure 2-16. 



1 main: int(), <restate.c 11> 

2 fprintf: <> 

3 exit: <> 

4 getopt: <> 

5 fqpen: <> 

6 fscanf: <> 

7 printf: <> 

8 oppty: floatO, <appty.c 7> 

9 pft: f loat( ) , <pft.c 7> 

10 rf e: f loat( ) , <rfe.c 8> 



Figure 2-16: cflow Output, No Options 
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The -r option looks at the caller:callee relationship from the other side. It 
produces the output shown in Figure 2-17. 




1 exit: <> 

2 nain : <> 

3 fopen: <> 

4 nain : 2 

5 fprintf: <> 

6 nain : 2 

7 fscanf: <> 

8 nain : 2 

9 getopt: <> 

10 nain : 2 

11 nain: int(), <restate.c 11> 

12 OE^pty: float(), <cqppty.c 7> 

13 nain : 2 

14 pft: floatO, <pft.c 7> 

15 nain : 2 

16 printf: <> 

17 nain : 2 

18 rfe: float(), <rfe.c 8> 

19 rcain : 2 




Figure 2-17: cflow Output, Using -r Option 
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The -ix option causes external and static data symbols to be included. 
Our sample program has only one such symbol, opterr. The output is shown 
in Figure 2-18. 



1 nmn: iiit(), <restate.c 11> 

2 fprintf: <> 

3 exit: <> 

4 cpterr: <> 

5 getopt: <> 

6 fopen: <> 

7 fscanf: <> 

8 printf: <> 

9 pppty: floatO, <oppty.c 7> 

10 pft: floatO, <pft.c 7> 

11 rfe: float(), <rfe.c 8> 



Figure 2-18: cflow Output, Using -ix Option 
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Combining the -r and the -ix options produces the output shown in 
Figure 2-19. 



1 exit: <> 

2 nain : <> 

3 fopen: <> 

4 Tmin : 2 

5 fprintf: <> 

6 nain : 2 

7 fscanf: <> 

8 nain : 2 

9 getqpt: <> 

10 main : 2 

11 nain: int(), <restate.c 11> 

12 ojRpty: flcato, <oppty.c 7> 

13 main : 2 

14 opterr; <> 

15 nain : 2 

16 pft: floatO, <pft.c 7> 

17 main : 2 

18 printf: <> 

19 nain : 2 

20 rf e: float( ) , <rfe.c 8> 

21 main : 2 



Figure 2-19: cflow Output, Using -r and -ix Options 



ctrace 

ctrace lets you follow the execution of a C program statement by state- 
ment, ctrace takes a .c file as input and inserts statements in the source code 
to print out variables as each program statement is executed. You must direct 
the output of this process to a temporary .c file. The temporary file is then 
used as input to cc. When the resulting a.out file is executed, it produces out- 
put that can tell you a lot about what is going on in your program. 
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Options give you the ability to limit the number of times through loops. 
You can also include functions in your source file that turn the trace off and 
on so that you can limit the output to portions of the program that are of par- 
ticular interest. 

ctrace accepts only one source code file as input. To use our sample pro- 
gram to illustrate, it is necessary to execute the following four commands: 

ctrace restate.c > ct.main.c 
ctrace oppty.c > ct.op.c 
ctrace pft.c > ct.p.c 
ctrace rfe.c > ct.r.c 

The names of the output files are completely arbitrary. Use any names 
that are convenient for you. The names must end in x, since the files are 
used as input to the C compilation system. 

cc -o ctrun ct.main.c ct.op.c ctp.c ct.r.c 

Now the command 

ct.run -Dpi 

produces the output shown in Figure 2-20. The command above will cause 
the output to be directed to your terminal (stdout). It is probably a good idea 
to direct it to a file or to a printer so that you can refer to it. 
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8 msdn(argc, argv) 
23 if (argc < 2) 
/* argc == 2 */ 

30 opterr = FALSE; 
/* FALSE == ♦/ 
/* opterr == ♦/ 

31 vtole ((ch = getopt(argc,argv,"opD:")) 1= EOF) 
/* argc ~ 2 */ 

/* argv == 15729316 */ 

/* ch ~ 111 or 'o' or "t" V 

32 { 

33 switch(ch) 

/* ch ~ 111 or 'o' or "t" */ 

35 case 'o': 

36 of lag = HUJE; 

/♦ TOUE == 1 or "h" */ 
/* of lag 1 or "h" */ 

37 break; 
48 } 

31 vdiile ((ch = getopt(argc,argv,"opr") ) 1= EOF) 
/* argc == 2 ♦/ 

/♦ argv == 15729316 ♦/ 
/♦ ch == 112 or 'p' V 

32 { 

33 switch(ch) 

/♦ ch == 112 or 'p' ♦/ 

38 case 'p': 

39 pflag = TRUE; 

/* raUE == 1 or "h" */ 
/♦ pflag == 1 or "h" ♦/ 



40 break; 
48 } 




Figure 2-20: ctrace Output (Sheet 1 of 3) 
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31 vihiXe ((ch = getopt(argc,argv»"opr") ) 1= 
/* argc == 2 */ 
/• argv = 15729316 ♦/ 

/* ch ~ 114 or 'r' ♦/ 

32 { 

33 switch(ch) 

/♦ ch == 114 or 'r' */ 

41 case 'r': 

42 rflag = TRUE; 

/* TRDE = 1 or "h" */ 
/♦ rflag == 1 or "h" ♦/ 

43 break; 

48 } 

31 while ((ch = getopt(argc,argv,"qpa:") ) != EOF) 
/♦ argc == 2 ♦/ 
/» argv == 15729316 */ 
/* ch == -1 •/ 

49 if ((fin = fopen( "info" , "r" ) ) == NULL) 
/♦ fin = 140200 V 

54 if (fscanf(fiii, "9fe%f96f9tf%f%f9ef",first,pnaiiie,&first.ppjc, 

&first.<^,&first.i,&first.c,&first.t,&first.spc) 1= 7) 

/♦ fin == 140200 V 

/♦ first.pnane == 15729528 */ 
61 printf( "Property: 9feO,first.pnaine) ; 

/* ficst.pnaine == 15729528 or "Lmaen_Plaoe" V Property: Liiiden_Place 

63 if(Qflag) 

/* oflag == 1 or "h" ♦/ 

64 printf(" Opportunity Cost: $%#5.2f0,oppty{&first) ); 
5 oppty(ps) 

8 retum(ps->i/12 * ps->t * ps-xip); 
/♦ ps->i == 1069044203 V 
/♦ ps->t == 1076494336 ♦/ 

/♦ ps-><3^ = 1088765312 ♦/ ppportonity Cost: $4476.87 



Figure 2-20: ctrace Output (Sheet 2 of 3) 
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if(pflag) 

/* pf lag == 1 or "h" */ 

printfC' Anticipated Profit(loss) : $%#7.2f0,pft{&first) ) ; 




67 



5 pft(ps) 

8 retum(ps->spic - ps->ppic + ps->c); 
/♦ ps->SP>c == 1091649040 */ 

/* ps->ppx == 1091178464 ♦/ 

/♦ ps->c == 1087409536 ♦/ Anticipated Erofit(loss) : $85950.00 

69 if(rflag) 

/♦ rf lag == 1 or "h" */ 

70 printf ("Return cn Funds Etaplcyed: %#3.2f5©«),rfe(&first) ) ; 

6 rfe(ps) 

9 retum(100 ♦ (ps->spjc - ps->c) / ps->spjc); 
/• ps->sp)c == 1091649040 ♦/ 

/♦ ps->c == 1087409536 */ Eetum cn Funds Biplqyed: 94.0056 
/♦ return ♦/ 



Figure 2-20: ctrace Output (Sheet 3 of 3) 



Using a program that runs successfully is not the optimal way to demon- 
strate ctrace. It would be more helpful to have an error in the operation that 
could be detected by ctrace. This utility might be most useful in cases where 
the program runs to completion, but the output is not as expected. 



cxref analyzes a group of C source code files and builds a cross-reference 
table of the automatic, static, and global symbols in each file. The command 

cxref -c -o cx.op restate.c oppty.c pft.c rfe.c 

produces the output shown in Figure 2-21 in a file named, in this case, cx.op. 
The -c option causes the reports for the four .c files to be combined in one 
cross-reference file. 





cxref 
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restate, c; 
pppty.c: 



pft.c: 
rfe.c: 



SYMBOL 


FILE 


FUNCnCN 


LINE 


BUFSIZ 


Aisr/include/stdio .h 





*9 


EOT* 


/usr/inclutJe/stdio.h 





49 *50 




restate, c 





31 


FALSE 


restate. c 


— _ 


♦6 15 


FILE 


Aisr/indtide/stdio .h 




♦29 73 




restate, c 


nain 


12 


L_cternd.d 


Aisr/inclucae/stdio .h 




*80 


L_cuserid 


/to:/include/stdio .h 




*81 


L 1 iii^w 


Ausr/include/stdio.h 




*83 


NULL 


Aisr/include/stdio . h 




46 M7 




restate, c 




49 


Pjbipdir 


/usr/include/stdio.h 




*82 




restate. c 




*5 36 


JEOBOF 


Aisr/incliade/stdio .h 




*41 


_IQEE?R 


/tor/incliade/stdio .h 




♦42 


_1GFBP 


Agr/include/stdio .h 




♦36 


_1CS£F 


/\3sr/include/stdio . h 




♦43 


_ICMXBOF 


Aisr/include/stdio.h 




♦40 


_ICNBF 


Aisr/include/stdio.h 




♦39 


JCOREAD 


/usr/incliide/stdio .h 




♦37 


JtORW 


Aisr/include/stdio .h 




♦44 


_ICWKP 


Aisr/include/stdio .h 




♦38 


_NFILE 


Aisr/include/stdio .h 




2 ^3 73 


_SBFSIZ 


Aisr/include/stdio .h 




♦16 




Figure 2-21: cxref Output, Using -c Option (Sheet 1 of 5) 
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SYMBOL 
_base 
JxifendO 

_bufendtab 
_bufsiz( ) 

_cnt 
_file 
_flag 
_iob 

jptx 
argc 

argv 

c 



ch 

clearerr( ) 
ctermicK ) 
cuserid( ) 



exitO 
fdopenO 



FILE RJNCrrCN 
Aisr/include/stdio . h 

ALsr/include/stdio . h 
/usr/include/stdio , h 

/usr/include/stdio , h 
Aisr/include/stdio . h 
Aisr/include/stdio . h 
Aisr/iiicl\Ji3e/stdio . h 
Aisr/incliide/stdio . h 
restate, c inaiji 
Aisr/include/stdio . h 
restate. c 
r estate. c 
r estate. c 
r estate. c 
./reocSef .h 
pft.c 
restate. c 
rfe.c 
restate. c 

Aisr/include/stdio . h 

/usr/include/stdio . h 

/usr/include/stdio . h 
./reodef .h 

opp^.c oppty 
restate. c nmn 

restate. c main 

/usr/include/stdio . h 



nain 



pft 
nain 
rfe 
nain 



UNE 
*26 

*57 
*78 

♦58 
*20 
*28 
♦27 
♦73 

25 26 45 51 57 

♦21 

8 

♦9 23 31 
8 

♦10 25 26 31 45 51 57 

♦6 

8 

55 

9 

♦18 31 33 

♦67 

♦77 

♦77 
— »4 

8 
55 

♦13 27 46 52 58 
♦74 



Figure 2-21: cxref Output, Using -c Option (Sheet 2 of 5) 
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SYMBOL 
feofO 

£errar( ) 

fgetsO 

filenoO 

fin 
first 
fopen( ) 

fprintf 
freopenO 

fscanf 
ftell( ) 

getc{) 

getcharO 

getopt( ) 

getsO 

i 



lint 
Eain( ) 



FILE 

/usr/include/stdio .h 

Aisr/include/stdlo .h 

/Vasr/include/stdio . h 

ABr/include/stdio.h 
restate. c 
restate. c 

Aisr/induae/stdio .h 
restate. c 
restate. c 

Ausr/indude/stdio .h 
restate. c 

Aer/include/stdio .h 
/usr/inclvide/stdio .h 
Atsr/indude/stdio .h 
restate. c 

Au3r/incl\2de/stdio .h 
./reodef.h 
oppty.c 
restate. c 

/\isr/include/stdio . h 
restate. c 



FUNCTION 



nmn 
nain 



nam 
nain 



oppty 
nain 



UNE 

♦68 

♦69 

♦77 

♦70 

♦12 49 54 

♦19 54 55 61 64 67 70 

♦74 
12 49 

25 26 45 51 57 

♦74 
54 

♦75 

♦61 

♦65 

♦14 31 

♦77 

♦5 

8 

55 

60 

♦8 



Figure 2-21: cxref Output, Using -c Option (Sheet 3 of 5) 
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SYMBOL 


FILE 


FTJNCnCN 


LXNE 


of lag 


restate . c 


inaiii 


*lo Jo 


oppty( ) 










oppty.c 




*5 




r estate. c 


main 


♦21 64 


opterr 


restate . c 


soaxn 




P 


/usr/inclxide/stdio . h 




*57 *58 


*62 63 64 


Of *o/ bo *oo by *by /U */U 






pdp11 


/usr/iiiciuae/stcu.o . n 




1 1 


pflag 


restate. c 






pft() 










pft.c 




*5 




restate. c 


nain 


*21 67 


pname 


./reodef .h 




*2 




restate . c 


main 


34 O 1 


popenO 










/usr/uiciiioe/srcn o . n 






ppx 


./rec3t3ef ,h 




*3 




pft.c 


pft 


o 
o 




r estate. c 


main 


04 


pjrintf 


r estate. c 


main 


O 1 D4 












opjpty.c 


oppty 


♦6 8 




pft.c 




5 




pft.c 


pft 


♦6 8 




rfe.c 




6 




rfe.c 


rfe 


*7 9 


pitcO 










/usr/include/stdio . h 




*62 


put-char ( ) 










Ausr/includfi/stdio . h 




*66 


rec 


./reodef .h 




*1 




oppty,c 


ORpty 


6 




pft.c 


pft 


6 




r estate. c 


nain 


19 




rfe.c 


rfe 


7 



Figure 2-21: cxref Output, Using -c Option (Sheet 4 of 5) 
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SYMBOL 
rewindC ) 

rfeO 



rflag 
S6tbu£() 

SpK 



stderr 

stdin 

stdout 

t 



'kenipnain{ ) 

tnpf ile{ ) 

tejziain( ) 

u370 
u3b 
u3b5 
vax 

X 



FELB ETJNCnCf} 

/usr/include/stdio . h 

restate, c nain 
rfe.c 

restate, c itain 

Aisr/ijiclude/stdio .h 
./recsdef.h 

pft.c pft 
restate, c nain 
rfe.c rfe 
/usr/iziclude/stdio . h 
restate. c 

Aisr/include/stdio . h 
Aisr/include/stdio . h 
./reodef.h 

pppty.c oppty 
restate, c min 

Aisr/iiiclude/stdio .h 

Aisr/incliade/stdio .h 

Aisr/include/stdio .h 
Aisr/ijiclude/stdio . h 
ALsr/include/stdio . h 
/^isr/include/stdio . h 
Aisr/include/stdio . h 
/usr/include/stdio , h 



LINE 
*76 

*21 70 
*6 

*17 42 69 

*76 

♦8 

8 

55 

9 

♦55 

25 26 45 51 57 

*53 

♦54 

♦7 

8 

55 

♦77 

♦74 

♦77 
5 

8 19 
8 19 
8 19 

♦62 63 64 66 ^66 



Figure 2-21: cxref Output, Using -c Option (Sheet 5 of 5) 
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lint 

lint looks for features in a C program that are apt to cause execution 
errors, that are wasteful of resources, or that create problems of portability. 
The following command produces the output shown in Figure 2-22: 

lint restate.c oppty.c pft.c rfe.c 



restate. c: 
restate.c 



(71) viaming: min( ) returns randan value to invocation envircBTment 

oppty.c: 

pft.c: 

rfe.c: 



function returns valxie v^ch is alva3rs ignored 
printf 



Figure 2-22: lint Output 



lint has options that will produce additional information. Check the 
Programmer's Reference Manual, The error messages give you the line numbers 
of some items you may want to review. 
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prof 

prof produces a report on the amount of execution time spent in various 
portions of your program and the number of times each function is called. 
The program must be compiled with the -p option. When a program that was 
compiled with that option is run, a file called mon.out is produced. mon.out 
and a.out (or whatever name identifies your executable file) are input to the 
prof command. 

The sequence of steps needed to produce a profile report for our sample 
program is as follows: 

Step 1: Compile the programs with the -p option: 
cc -p restate.c oppty.c pft.c rfe.c 

Step 2: Run the program to produce a file mon.out. 
a.out -opr 

Step 3: Execute the prof command: 
prof a.out 



The example of the output of this last step is shown in Figure 2-23. The 
figures may vary from one run to another. You will also notice that programs 
of very small size, like that used in the example, produce statistics that are not 
overly helpful. 
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SeooEnds 


Cumsecs 


#Calls 
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Figure 2-23: prof Output 
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size 

size produces information on the number of bytes occupied by the three 
sections (text, data, and bss) of a common object file when the program is 
brought into main memory to be run. Here are the results of one invocation 
of the size command with our object file as an argument. 

11832 + 3872 + 2240 = 17944 

Do not confuse this number with the number of characters in the object 
file that appears when you do an Is -1 command. That figure includes the 
symbol table and other header information that is not used at run time. 



strip 

strip removes the symbol and line number information from a common 
object file. When you issue this command, the number of characters shown 
by the Is -1 command approaches the figure shown by the size command, but 
still includes some header information that is not counted as part of the .text, 
.data, or .bss section. After the strip command has been executed, it is no 
longer possible to use the file with the sdb command. 



sdb 

sdb stands for Symbolic Debugger, which means you can use the sym- 
bolic names in your program to pinpoint where a problem has occurred. You 
can use sdb to debug C programs. There are two basic ways to use sdb: by 
running your program under control of sdb, or by using sdb to rummage 
through a core image file left by a program that failed. The first way lets you 
see what the program is doing up to the point at which it fails (or to skip 
around the failure point and proceed with the run). The second method lets 
you check the status at the moment of failure, which may or may not disclose 
the reason the program failed. 
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Chapter 15 contains a tutorial on sdb that describes the interactive com- 
mands you can use to work your way through your program. For the time 
being we want to tell you just a couple of key things you need to do when 
using it. 

1 . Compile your program(s) with the -g option, which causes additional 
information to be generated for use by sdb. 

2. Run your program under sdb with the following command: 

sdb myprog - srcdir 

where myprog is the name of your executable file (a.out is the 
default), and srcdir is an optional list of the directories where source 
code for your modules may be found. The dash between the two 
arguments keeps sdb from looking for a core image file. 
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The following three utilities are helpful in keeping your programming 
work organized effectively. 

The make Command 

When you have a program that is made up of more than one module of 
code you begin to run into problems of keeping track of which modules are 
up-to-date and which need to be recompiled when changes are made in 
another module. The make command is used to ensure that dependencies 
between modules are recorded so that changes in one module results in the 
re-compilation of dependent programs. Even control of a program as simple 
as the one shown in Figure 2-15 is made easier through the use of make. 

The make utility requires a description file that you create with an editor. 
The description file (also referred to by its default name: makefile) contains 
the information used by make to keep a target file current. The target file is 
typically an executable program. A description file contains three types of 
information: 

dependency information tells the make utility the relationship between 

the modules that comprise the target program. 

executable commands are needed to generate the target program, make 

uses the dependency information to determine 
which executable commands should be passed to 
the shell for execution. 

macro definitions provide a shorthand notation within the descrip- 

tion file to make maintenance easier. Macro 
definitions can be overridden by information 
from the command line when the make com- 
mand is entered. 

The make command works by checking the "last changed" time of the 
modules named in the description file. When make finds a component that 
has been changed more recentiy than modules that depend on it, the specified 
commands (usually compilations) are passed to the shell for execution. 
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The make command takes three kinds of arguments: options, macro 
definitions, and target file names. If no description file name is given as an 
option on the command line, make searches the current directory for a file 
named makefile or Makefile. Figure 2-24 shows a makefile for our sample 
program. 

f 

oeJBCTS s restate. o op pt y .o pft.o rfe.o 
all: restate 
restate: $(CSJBC7rS) 

$(CX:) $(CFLAGS) $(IJ3FLAGS) $(OBJEnrS) -o restate 

$(CBJBCTS): ./reodef.h 

clean: 

rm -f $(GejBCJES) 

cloUser: dean 
xm -f restate 

V 

Figure 2-24: make Description File 



The following things are worth noticing in this description file: 

■ It identifies the target, restate, as being dependent on the four object 
modules. Each of the object modules in turn is defined as being depen- 
dent on the header file, recdef.h, and by default, on its corresponding 
source file. 

■ A macro, OBJECTS, is defined as a convenient shorthand for referring 
to all of the component modules. 

Whenever testing or debugging results in a change to one of the com- 
ponents of restate, for example, a command such as the following should be 
entered: 

make CFLAGS=-g restate 
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This has been a very brief overview of the make utility. There is more on 
make in Chapter 3, and a detailed description of make can be found in 
Chapter 13. 



The Archive 

The most common use of an archive file, although not the only one, is to 
hold object modules that make up a library. The library can be named on the 
link editor command line (or with a link editor option on the cc command 
line). This causes the link editor to search the symbol table of the archive file 
when attempting to resolve references. 

The ar command is used to create an archive file, to manipulate its con- 
tents, and to maintain its symbol table. The structure of the ar command is a 
little different from the normal UNIX System arrangement of command line 
options. When you enter the ar command you include a one-character key 
from the set drqtpmx that defines the type of action you intend. The key 
may be combined with one or more additional characters from the set 
vuaibcls that modify the way the requested operation is performed. The 
makeup of the command line is 

ar -keyl [posname] afile [name],,, 

where posname is the name of a member of the archive and may be used with 
some optional key characters to make sure that the files in your archive are in 
a particular order. The afile argument is the name of your archive file. By 
convention, the suffix .a is used to indicate that the named file is an archive 
file, (libca, for example, is the archive file that contains many of the object 
files of the standard C subroutines.) One or more names may be furnished. 
These identify files that are subjected to the action specified in the key. 

We can make an archive file to contain the modules used in our sample 
program, restate. The command to do this is 

ar -rv rste.a restate.o oppty.o pft«o rfe.o 

If these are the only .o files in the current directory, you can use shell 
metacharacters as follows: 

ar -rv rste.a *.o 
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Either command will produce this feedback: 

a - restate. o 
a - oppty.o 
a - pft.o 
a - rfe.o 

ar: creating rste.a 

The nm command is used to get a variety of information from the symbol 
table of common object files. The object files can be, but do not have to be, 
in an archive file. Figure 2-25 shows the output of this command when exe- 
cuted with the -f (for full) option on the archive we just created. The object 
files were compiled with the -g option. 
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Symbols from rste.a[restate.o] 



Name 


Value 


L-iass 


Type 




T 1T1P 




.Ofake 






strtag 


struct 


16 




restate.c 




file 










_cnt 





strmem 


int 








—ptr 


4 


strmem 


*Uchar 








—base 


8 


strmem 


♦Uchar 








-flag 


12 


strmem 


char 








-fUe 


13 


strmem 


char 








.eos 




endstr 




16 






rec 




strtag 


struct 


52 






pname 





strmem 


char[25] 


25 






ppx 


28 


strmem 


float 








dp 


32 


strmem 


float 








i 


36 


strmem 


float 








c 


40 


strmem 


float 








t 


44 


strmem 


float 








spx 


48 


strmem 


float 








.eos 




endstr 




52 






main 





extern 


into 


520 




.text 


.bf 


10 


fen 






11 


.text 


argc 





argm't 


int 








argv 


4 


argm't 


**char 








fin 





auto 


♦struct-.Ofake 


16 






oflag 


4 


auto 


int 








pflag 


8 


auto 


int 








rflag 


12 


auto 


int 








ch 


16 


auto 


int 









Figure 2-25: nm Output, with -f Option (Sheet 1 of 5) 
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Symbols from rste.a[restate,o] 





Value 


Class 


Type 


Size 


Line 


Section 


first 


20 


auto 


struct-rec 


52 






.ef 


518 


fen 






61 


.text 


FILE 




typdef 


struct-.Ofake 


16 






.text 





static 




31 


39 


.text 


.data 


520 


static 






4 


.data 


.bss 


824 


static 








.t)SS 


—iob 





extern 










fprintf 





extern 










exit 





extern 










opterr 





extern 










getopt 





extern 










fopen 





extern 










fscanf 





extern 










printf 





extern 










oppty 





extern 










pft 





extern 










rfe 





extern 











Figure 2-25: nm Output, with -f Option (Sheet 2 of 5) 
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Symbols from rste.a[oppty.o] 



Name 


Value 


Class 




Ql<70 




oecQon 


oppty.c 




file 










rec 




strtag 


Struct 


52 






pname 





strmem 


char[25] 


25 






ppx 


28 


strmem 


float 








dp 


32 


strmem 


float 








i 


36 


strmem 


float 








c 


40 


strmem 


float 








t 


44 


strmem 


float 








spx 


48 


strmem 


float 








.eos 




endstr 




52 






oppty 





extern 


floatO 


64 




.text 


.bf 


10 


fen 






7 


.text 


ps 





argm't 


*struct-ree 


52 






.ef 


62 


fen 






3 


.text 


.text 





static 




4 


1 


.text 


.data 


64 


statie 








.data 


.bss 


72 


static 








.bss 



Figure 2-25: nm Output, with -f Option (Sheet 3 of 5) 
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Symbols from rste.a[pft.o] 



Nam6 


Value 


Class 


Type 


Size 


Line 


Section 


pftc 




file 




52 






rec 




strtag 


Struct 






pname 





strmem 


char[25] 


25 






ppx 


28 


strmem 


float 








dp 


32 


strmem 


float 








i 


36 


strmem 


float 








c 


40 


strmem 


float 








t 


44 


strmem 


float 








spx 


48 


strmem 


float 








..eos 




endstr 




52 






pft 





extern 


floatO 


60 




.text 


..bf 


10 


fen 






7 


.text 


ps 





argm't 


*struct-rec 


52 






..ef 


58 


fen 






3 


.text 


..text 





static 




4 




.text 


..data 


60 


static 








.data 


..bss 


60 


static 








.bss 



Figure 2-25: nm Output, with -f Option (Sheet 4 of 5) 
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Symbols from rste.a[rfe,o] 



Name 


Value 


Class 


Type 


Size 


Line 


Section 


rfe.c 




flip 

I lie 










rec 




strtag 




DJL 






L/ltCtll IC 




u 


airiiiem 


cnar[ZDj 


25 






DDX 


28 




noai 








do 




5>iriiieui 


noat 












suiiiem 


iioat 








C 


40 


strmem 


float 








t 


44 


strmem 


float 








spx 


48 


strmem 


float 








.eos 




endstr 




52 






rfe 





extern 


floatO 


68 




.text 


.bf 


10 


fen 






8 


.text 


ps 





argm't 


*struct-rec 


52 






.ef 


64 


fen 






3 


.text 


.text 





static 




4 


1 


.text 


.data 


68 


static 








.data 


.bss 


76 


static 








,bss 



Figure 2-25: nm Output, with -f Option (Sheet 5 of 5) 



For nm to work on an archive file all of the contents of the archive have 
to be object modules. If you have stored other things in the archive, you will 
get the message 

ran: rste.a bad magic 

when you try to execute the command. 



Use of sees by Single-User Programmers 

The UNIX System Source Code Control System (SCCS) is a set of pro- 
grams designed to keep track of different versions of programs. When a pro- 
gram has been placed under control of SCCS, only a single copy of any one 
version of the code can be retrieved for editing at a given time. When 
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program code is changed and the program returned to SCCS, only the 
changes are recorded. Each version of the code is identified by its SID, or 
SCCS IDentifying number. By specifying the SID when the code is extracted 
from the SCCS file, it is possible to return to an earlier version. If an early 
version is extracted with the intent of editing it and returning it to SCCS, a 
new branch of the development tree is started. The set of programs that make 
up SCCS appear as UNIX System commands. The commands are as follows: 

admin 

get 

delta 

prs 

rmdel 

cdc 

what 

sccsdiff 

comb 

val 

It is most common to think of SCCS as a tool for project control of large 
programming projects. It is, however, entirely possible for any individual user 
of the UNIX System to set up a private SCCS system. Chapter 14 is an SCCS 
user's guide. 
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Introduction 



This chapter deals with programming where the objective is to produce 
sets of programs (applications) that will run on a UNIX System computer. 

The chapter begins with a discussion of how the ground rules change as 
you move up the scale from writing programs that are essentially for your 
own private use (we have called this single-user programming), to working as 
a member of a programming team developing an application that is to be 
turned over to others to use. 

There is a section on how the criteria for selecting appropriate program- 
ming languages may be influenced by the requirements of the application. 

The next three sections of the chapter deal with a number of loosely- 
related topics that are of importance to programmers working in the applica- 
tion development environment. Most of these mirror topics that were dis- 
cussed in Chapter 2, Programming Basics, but here we try to point out aspects 
of the subject that are particularly pertinent to application programming. 
They are covered under the following headings: 

■ Advanced Programming Tools 

deals with such topics as File and Record Locking, Interprocess Com- 
munication, and programming terminal screens. 

■ Programming Support Tools 

covers the Common Object File Format, link editor directives, shared 
libraries. Symbolic Debugger (sdb), and lint. 

■ Project Control Tools 

includes some discussion of make and SCCS. 

The chapter concludes with a description of a sample application called 
liber that uses several of the components described in earlier portions of the 
chapter. 
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The characteristics of the application programming environment that make 
it different from single-user programming have at their base the need for 
interaction and for sharing of information. 



Numbers 

Perhaps the most obvious difference between application programming 
and single-user programming is in the quantities of the components. Not only 
are applications generally developed by teams of programmers, but the 
number of separate modules of code can grow into the hundreds on even a 
fairly simple application. 

When more than one programmer works on a project, there is a need to 
share such information as follows: 

■ the operation of each function 

■ the number, identity, and type of arguments expected by a function 

■ if pointers are passed to a function, are the objects being pointed to 
modified by the called function, and what is the lifetime of the 
pointed-to object 

■ the data type returned by a function 

In an application, there is an odds-on possibility that the same function 
can be used in many different programs, by many different programmers. 
The object code needs to be kept in a library accessible to anyone on the pro- 
ject who needs it. 



Portability 

When you are working on a program to be used on a single model of a 
computer, your concerns about portability are minimal. In application 
development, on the other hand, a desirable objective often is to produce code 
that will run on many different UNIX System computers. Some of the things 
that affect portability will be touched on later in this chapter. 
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Documentation 

A single-user program has modest needs for documentation. There 
should be enough to remind the program's creator how to use it and what the 
intent was in portions of the code. 

On an application development project there is a significant need for two 
types of internal documentation: 

■ comments throughout the source code that enable successor program- 
mers to understand easily what is happening in the code. Applications 
can be expected to have a useful life of 5 or more years and frequently 
need to be modified during that time. It is not realistic to expect that 
the same person who v^ote the program will always be available to 
make modifications. Even if that does happen, the comments v^ll 
make the maintenance job a lot easier. 

■ hard-copy descriptions of functions should be available to all members 
of an application development team. Without them, it is difficult to 
keep track of available modules, which can result in the same function 
being written over again. 

Unless end-users have clear, readily-available instructions in how to 
install and use an application, they either will not do it at all (if that is an 
option) or do it improperly. 

The microcomputer software industry has become ever more keenly aware 
of the importance of good end-user documentation. There are cases on record 
where the success of a software package has been attributed in large part to 
the fact that it had exceptionally good documentation. There are also cases 
where a pretty good piece of software was not widely used due to the inacces- 
sibility of its manuals. There appears to be no truth to the rumor that in one 
or two cases, end-users have thrown the software away and just read the 
manual. 
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Project Management 

Without effective project management, an application development project 
is in trouble. This subject will not be dealt with in this guide, except to men- 
tion the following three things that are vital functions of project management: 

■ tracking dependencies between modules of code 

■ dealing with change requests in a controlled way 

■ seeing that milestone dates are met. 
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In this section we talk about some of the considerations that influence the 
selection of programming languages and describe three of the special purpose 
languages that are part of the UNIX System environment. 



Influences 

In single-user programming the choice of language is often a matter of 
personal preference; a language is chosen because it is the one the program- 
mer feels most comfortable with. 

An additional set of considerations comes into play when making the 
same decision for an application development project. 

Is there an existing standard within the organization that should be 
observed? 

A firm may decide to emphasize one language because a good sup- 
ply of programmers is available who are familiar with it. 

Does one language have better facilities for handling the particular 
algorithm? 

One would like to see all language selection based on such objec- 
tive criteria, but it is often necessary to balance this against the 
skills of the organization. 

Is there an inherent compatibility between the language and the UNIX 
Operating System? 

This is sometimes the impetus behind selecting C for programs 
destined for a UNIX System machine. 

Are there existing tools that can be used? 

If parsing of input lines is an important phase of the application, 
perhaps a parser generator such as yacc should be employed to 
develop what the application needs. 
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Does the application integrate other software into the whole package? 

If, for example, a package is to be built around an existing database 
management system, there may be constraints on the variety of 
languages the database management system can accommodate. 

Special Purpose Languages 

The UNIX System contains a number of tools that can be included in the 
category of special purpose languages. Three that are especially interesting 
are awk, lex, and yacc. 

What awk Is Like 

The awk utility scans an ASCII input file record by record, looking for 
matches to specific patterns. When a match is found, an action is taken. Pat- 
terns and their accompanying actions are contained in a specification file 
referred to as the program. The program can be made up of a number of 
statements. However, since each statement has the potential for causing a 
complex action, most awk programs consist of only a few. The set of state- 
ments may include definitions of the pattern that separates one record from 
another (a newline character, for example) and definitions of what separates 
one field of a record from the next (white space, for example). It may also 
include actions to be performed before the first record of the input file is read, 
and other actions to be performed after the final record has been read. All 
statements in between are evaluated in order, for each record in the input file. 
To paraphrase the action of a simple awk program, it would go something 
like this: 

Look through the input file. 

Every time you see this specific pattern, do this action. 

A more complex awk program might be paraphrased like this: 

First do some initialization. 

Then, look through the input file. 

Every time you see this specific pattern, do this action. 

Every time you see this other pattern, do another action. 

After all the records have been read, do these final things. 
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The directions for finding the patterns and for describing the actions can 
get pretty complicated, but the essential idea is as simple as the two sets of 
statements above. 

One of the strong points of awk is that once you are familiar with the 
language syntax, programs can be written very quickly. They do not always 
run very fast, however, so they are seldom appropriate if you want to run the 
same program repeatedly on a large quantities of records. In such a case, it is 
likely to be better to translate the program to a compiled language. 

How awk Is Used 

One typical use of awk would be to extract information from a file and 
print it out in a report. Another might be to pull fields from records in an 
input file, arrange them in a different order, and pass the resulting rearranged 
data to a function that adds records to your database. There is an example of 
a use of awk in the sample application at the end of this chapter. 

Where to Find More Information 

The manual page for awk is in Section (1) of the User's /System 
Administrator's Reference Manual, Chapter 4 of this guide contains a descrip- 
tion of the awk syntax and a number of examples showing ways in which 
awk may be used. 

What lex and yacc Are Like 

The utilities lex and yacc are often mentioned in the same breath because 
they perform complementary parts of what can be viewed as a single task, 
making sense out of input. The two utilities also share the common charac- 
teristic of producing source code for C language subroutines from specifica- 
tions that appear on the surface to be quite similar. 

Recognizing input is a recurring problem in programming. Input can be 
from various sources. In a language compiler, for example, the input is nor- 
mally contained in a file of source language statements. The UNIX System 
shell language most often receives its input from a person keying in com- 
mands from a terminal. Frequently, information coming out of one program is 
fed into another where it must be evaluated. 

The process of input recognition can be subdivided into two tasks, lexical 
analysis and parsing, and that is where lex and yacc come in. In both utili- 
ties, the specifications cause the generation of C language subroutines that 
deal with streams of characters; lex generates subroutines that do lexical 
analysis, while yacc generates subroutines that do parsing. 
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To describe those two tasks in dictionary terms: 

Lexical analysis has to do with identifying the words or vocabulary of 
a language as distinguished from its grammar or structure. 

Parsing is the act of describing units of the language grammatically. 
Students in elementary school are often taught to do this with sen- 
tence diagrams. 

Of course, the important thing to remember here is that in each case the 
rules for our lexical analysis or parsing are those we set down ourselves in the 
lex or yacc specifications. Because of this, the dividing line between lexical 
analysis and parsing sometimes becomes fuzzy. 

The fact that lex and yacc produce C language source code means that 
these parts of what may be a large programming project can be separately 
maintained. The generated source code is processed by the C compiler to pro- 
duce an object file. The object file can be link edited with others to produce 
programs that then perform whatever process follows from the recognition of 
the input. 

How lex Is Used 

A lex subroutine scans a stream of input characters and waves a flag each 
time it identifies something that matches one or another of its rules. The 
waved flag is referred to as a token. The rules are stated in a format that 
closely resembles the one used by the UNIX System text editor for regular 
expressions. For example, 

t \t]+ 

describes a rule that recognizes a string of one or more blanks or tabs (without 
mentioning any action to be taken). A more complete statement of that rule 
might have the following notation: 

[ \t]+ ; 

which, in effect, says to ignore white space. It carries this meaning because 
no action is specified when a string of one or more blanks or tabs is recog- 
nized. The semicolon marks the end of the statement. 
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Another rule, one that does take some action, could be stated like this: 

[0-9]+ { 

i = atoi(yytext) ; 
retum(NBR) ; 
} 

This rule depends on several things: 

NBR must have been defined as a token in an earlier part of the lex 
source code called the declaration section. (It may be in a header file 
which is #include'd in the declaration section.) 

i is declared as an extern int in the declaration section. 

It is a characteristic of lex that things it finds are made available in a 
character string called yytext. 

Actions can make use of standard C syntax. Here, the standard C 
subroutine, atoi, is used to convert the string to an integer. 

What this rule boils down to is lex saying, " Hey, I found the kind of 
token we call NBR, and its value is now in i. " 

To review the steps of the process: 

1 . The lex specification statements are processed by the lex utility to 
produce a file called lex.yy.c. (This is the standard name for a file 
generated by lex, just as a.out is the standard name for the executable 
file generated by the link editor.) 

2. lex.yy.c is transformed by the C compiler (with a -c option) into an 
object file called lex.yy.c that contains a subroutine called yylexO. 

3. lex.yy.o is link edited with other subroutines. Presumably one of 
those subroutines will call yylexO with a statement such as 

\>diile( (token = yylex( ) ) 1=0) 

and other subroutines (or even main) will deal with what comes back. 
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Where to Find More Information 

The manual page for lex is in Section (1) of the Programmer's Reference 
Manual. A tutorial on lex is contained in Chapter 5 of this guide. 

How yacc Is Used 

The yacc subroutines are produced by pretty much the same series of 
steps as lex. 

1 . The yacc specification is processed by the yacc utility to produce a file 
called y.tab.c. 

2. y.tab.c is compiled by the C compiler producing an object file, y.tab.o, 
that contains the subroutine yyparseO. A significant difference is that 
yyparseO calls a subroutine called yylexO to perform lexical analysis. 

3. The object file y.tab.o may be link edited with other subroutines, one 
of which will be called yylexO. 

There are two things worth noting about this sequence: 

1 . The parser generated by the yacc specifications calls a lexical analyzer 
to scan the input stream and return tokens. 

2. While the lexical analyzer is called by the same name as one produced 
by lex, it does not have to be the product of a lex specification. It can 
be any subroutine that does the lexical analysis. 

What really differentiates these two utilities is the format for their rules. 
As noted above, lex rules are regular expressions like those used by UNIX 
System editors, yacc rules are chains of definitions and alternative definitions, 
written in Backus-Naur form, accompanied by actions. The rules may refer to 
other rules defined further down the specification. Actions are sequences of C 
language statements enclosed in braces. They frequently contain numbered 
variables that enable you to reference values associated with parts of the rules. 
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An example might make that easier to understand. 




%tQken 



NUMBER 




expr 



I expr ' + ' expr 
I expr '-* expr 
I e«pr '*' expr 
I eaqpar '/' expr 
I '(' expr 



{ $$ 
{ $$ 
{ $$ 
{ $$ 
{ $$ 
{ $$ 



$1; } 



$1 + $3; } 

$1 - $3; } 

$1 ♦ $3; } 

$1 / $3; ) 
$2; } 



: NUMBEH 



{ $$ = $1; } 




This fragment of a yacc specification shows 

■ NUMBER identified as a token in the declaration section 

■ the start of the rules section indicated by the pair of percent signs 

■ a number of alternate definitions for expr separated by the 1 sign and 
terminated by the semicolon 

■ actions to be taken when a rule is matched 

■ within actions, numbered variables used to represent components of 
the rule: 

$$ means the value to be returned as the value of the whole rule 

$n means the value associated with the nth component of the rule, 
counting from the left 

■ numb defined as meaning the token NUMBER. This is a trivial exam- 
ple that illustrates that one rule can be referenced within another, as 
well as within itself. 

As with lex, the compiled yacc object file will generally be link edited with 
other subroutines that handle processing that takes place after the parsing — or 
even ahead of it. 
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Where to Find More Information 

The manual page for yacc is in Section (1) of the Programmer's Reference 
Manual A detailed description of yacc may be found in Chapter 6 of this 
guide. 
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In Chapter 2 we described the use of such basic elements of programn\ing 
in the UNIX Systenx environment as the standard I/O library, header files, 
system calls, and subroutines. In this section we introduce tools that are more 
apt to be used by members of an application development team than by a 
single-user programmer. The section contains material on the following 
topics: 

■ memory management 

■ file and record locking 

■ interprocess communication 

■ programming terminal screens. 



Memory Management 

There are situations where a program needs to ask the operating system 
for blocks of memory. It may be, for example, that a number of records have 
been extracted from a database and need to be held for some further process- 
ing. Rather than writing them out to a file on secondary storage and then 
reading them back in again, it is likely to be a great deal more efficient to hold 
them in memory for the duration of the process, (This is not to ignore the 
possibility that portions of memory may be paged out before the program is 
finished; but such an occurrence is not pertinent to this discussion.) There are 
two C language subroutines available for acquiring blocks of memory, and 
they are both called maUoc. One of them is malloc(3C), the other is 
malloc(3X). Each has several related functions that do specialized tasks in the 
same area. They are 

■ free — to inform the system that space is being relinquished 

■ realloc — to change the size and possibly move the block 

■ calloc — to allocate space for an array and initialize it to zeros 

In addition, malloc(3X) has a function, mallopt, that provides for control 
over the space allocation algorithm, and a structure, mallinfo, from which the 
program can get information about the usage of the allocated space. 
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malloc(3X) runs faster than the other version. It is loaded by specifying 
-Imalloc 

on the cc(l) or ld(l) command line to direct the link editor to the proper 
library. When you use malloc(3X), your program should contain the state- 
ment 

#include <nalloc,h> 

where the values for mallopt options are defined. 

See the Programmer's Reference Manual for the formal definitions of the 
two mallocs. 



File and Record Locking 

The provision for locking files, or portions of files, is primarily used to 
prevent the sort of error that can occur when two or more users of a file try to 
update information at the same time. The classic example is the airlines reser- 
vation system where two ticket agents each assign a passenger to Seat A, 
Row 5 on the 5 o'clock flight to Detroit. A locking mechanism is designed to 
prevent such mishaps by blocking Agent B from even seeing the seat assign- 
ment file until Agent A's transaction is complete. 

File locking and record locking are really the same thing, except that file 
locking implies the whole file is affected, and record locking means that only 
a specified portion of the file is locked. (Remember, in the UNIX System, file 
structure is undefined; a record is a concept of the programs that use the file.) 

Two types of locks are available: read locks and write locks. If a process 
places a read lock on a file, other processes can also read the file, but all are 
prevented from writing to it, that is, changing any of the data. If a process 
places a write lock on a file, no other processes can read or write in the file 
until the lock is removed. Write locks are also known as exclusive locks. The 
term shared lock is sometimes applied to read locks. 

Another distinction needs to be made between mandatory and advisory 
locking. Mandatory locking means that the discipline is enforced automati- 
cally for the system calls that read, write, or create files. This is done through 
a permission flag established by the file's owner (or the super-user). Advisory 
locking means that the processes that use the file take the responsibility for 
setting and removing locks as needed. Thus mandatory may sound like a 
simpler and better deal, but it is not so. The mandatory locking capability is 
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included in the system to comply with an agreement with /usr /group, an 
organization that represents the interests of UNIX System users. The principal 
weakness in the mandatory method is that the lock is in place only while the 
single system call is being made. It is extremely common for a single transac- 
tion to require a series of reads and writes before it can be considered com- 
plete. In cases like this, the term atomic is used to describe a transaction that 
must be viewed as an indivisible unit. The preferred way to manage locking 
in such a circumstance is to make certain the lock is in place before any I/O 
starts, and that it is not removed until the transaction is done. That calls for 
locking of the advisory variety. 

How File and Record Locking Worlcs 

The system call for file and record locking is fcntl(2). Programs should 
include the line 

#include <fcntl.h> 

to bring in the header file shown in Figure 3-1. 
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/* Flag values accessible to open(2) and fcn[tl(2) V 



/♦ (The first three can only be set hy open) */ 


#de£ine 


q_RD0NL5f 





#de£ijie 


0_WRONLSr 


1 


#de£ijie 


0_RDWR 


2 


#de£ine 


0_NDELffy 


04 /♦ NoQi-blocking X/0 */ 


#de£ine 


_AF!PEMD 


010 /♦ append (writes guaranteed at the end) */ 


#defiiie 


0_S«NC 


020/* synchrontxis write ppticn ♦/ 


/* Flag 


values accessible only to open(2) ♦/ 


^define 


0_a?EAT 


00400 /* open with file create (uses third open arg)V 


#define 


0_1RUNC 


01000 /* open with truncation ♦/ 


#define 


OJEXCL 


02000 /* exclusive open */ 


/* fcntl(2) requests ♦/ 


#de£iiie 


F_pUPET> 


/♦ IXqplicate f ildes ♦/ 


4«rde£iiie 


FJSETED 


1 /* Get f ildes flags V 


#def3ue 


F_SErED 


2 /* Set f ildes flags */ 


#defijie 


F_GEEFL 


3 /* Get file flags ♦/ 


#de£ine 


F_SErEFL 


4 /* Set file flags */ 


#de£lne 


F_GEEIK 


5 /♦Get file lock V 


#define 




6 /* Set file lock V 


#defiiie 


F_SE7rLKW 


7 /* Set file lock and wait ♦/ 


#de£i2ie 


F_CHKEL 


8 /* Check legality of file flag changes */ 


/♦ file 


segment locking set data type - infomatian passed to system by user */ 


struct flock { 






short 


ijtype; 




short 


Ivdience; 




long 


Ijstart; 




long 


IJlen; /* len = means until end of file */ 




short 


l^sysid; 


}; 


short 


Ijpid; 


/* file 


segment locking types */ 
/* Read lock ♦/ 


#defiiie 


F_RDDCK 


01 




/* Write lock */ 


#define 


F_WRLCK 


02 




/* Renove 


lock(s) ♦/ 


#deflne 


FJUNLCK 


03 



Figure 3-1: The fcntLh Header File 
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The format of the f cntl(2) system call is 

int fc3itl(fildes, end, arg) 
int fildesy end, arg; 

fildes is the fUe descriptor returned by the open system call. In addition to 
defining tags that are used as the commands on fcntl system calls, fcntLh 
includes the declaration for a struct flock that is used to pass values that con- 
trol where locks are to be placed. 

lockf 

A subroutine, lock£(3), can also be used to lock sections of a file or an 
entire file. The format of lockf is 

#ijicliide <unistd.h> 

int lockf (fildes, ftmction, size) 
int fildes, function; 
long size; 

fildes is the file descriptor; function is one of four control values defined in 
unistd.h that let you lock, unlock, test and lock, or simply test to see if a lock 
is already in place, size is the number of contiguous bytes to be locked or 
unlocked. The section of contiguous bytes can be either forward or backward 
from the current offset in the file. [You can arrange to be somewhere in the 
middle of the file by using the lseek(2) system call.] 

Where to Find More Information 

There is an example of file and record locking in the sample application at 
the end of this chapter. The manual pages that apply to this facility are 
fcntl(2), £cntl(5), lockf{3), and chmod(2) in the Programmer's Reference 
Manual, Chapter 7 of this guide is a detailed discussion of the subject with a 
number of examples. 



Interprocess Communications 

In Chapter 2 we described forking and execing as methods of communi- 
cating between processes. Business applications running on a UNIX System 
computer often need more sophisticated methods. In applications, for exam- 
ple, where fast response is critical, a number of processes may be brought up 
at the start of a business day to be constantly available to handle transactions 
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on demand. This cuts out initialization time that can add seconds to the time 
required to deal with the transaction. To go back to the ticket reservation 
example again for a moment, if a customer calls to reserve a seat on the 
5 o'clock flight to Detroit, you do not want to have to say, "Yes, sir. Just 
hang on a minute while I start up the reservations program. " In transaction- 
driven systems, the normal mode of processing is to have all the components 
of the application standing by waiting for some sort of an indication that there 
is work to do. 

To meet requirements of this type the UNIX System offers a set of nine 
system calls and their accompanying header files, all under the umbrella name 
of Interprocess Communications (IPC). 

The IPC system calls come in sets of three; one set each for messages, 
semaphores, and shared memory. These three terms define three different 
styles of communication between processes: 

messages communication is in the form of data stored in a buffer. 

The buffer can be either sent or received. 

semaphores communication is in the form of positive integers with a 
value between and 32,767. Semaphores may be con- 
tained in an array the size of which is determined by the 
system administrator. The default maximum size for the 
array is 25. 

shared memory communication takes place through a common area of 
main memory. One or more processes can attach a seg- 
ment of memory and as a consequence can share what- 
ever data is placed there. 

The sets of IPC system calls are 

msgget semget shmget 
msgcti semctl shmctl 
msgop semop shmop 



IPC get Calls 

The get calls each return to the calling program an identifier for the type 
of IPC facility that is being requested. 
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IPC ctl Calls 

The ctl calls provide a variety of control operations that include obtaining 
(IPC-STAT), setting (IPC_SET), and removing (IPC_RMID) the values in 
data structures associated with the identifiers picked up by the get calls. 

IPC op Calls 

The op manual pages describe calls that are used to perform the particular 
operations characteristic of the type of IPC facility being used, msgop has 
calls that send or receive messages, semop (the only one of the three that is 
actually the name of a system call) is used to increment or decrement the 
value of a semaphore, among other functions, shmop has calls that attach or 
detach shared memory segments. 

Where to Find More Information 

An example of the use of some IPC features is included in the sample 
application at the end of this chapter. The system calls are all located in Sec- 
tion (2) of the Programmer's Reference ManuaL Do not overlook intro(2). It 
includes descriptions of the data structures that are used by IPC facilities. A 
detailed description of IPC, with many code examples that use the IPC system 
calls, is contained in Chapter 9 of this guide. 



Programming Terminal Screens 

The facility for setting up terminal screens to meet the needs of your 
application is provided by two parts of the UNIX System. The first of these, 
terminfo, is a database of compiled entries that describe the capabilities of ter- 
minals and the way they perform various operations. 

The terminfo database normally begins at the /usr/lib/ terminfo direc- 
tory. Members of this directory are themselves directories, generally with 
single-character names that are the first character in the name of the terminal. 
The compiled files of operating characteristics are at the next level down the 
hierarchy. For example, the entry for a Teletype 5425 is located in both the 
file /usr/lib/terminfo/5/5425 and the file /usr/lib/terminfo/t/tty5425. 

Describing the capabilities of a terminal can be a painstaking task. Quite 
a good selection of terminal entries is included in the terminfo database that 
comes with your computer. However, if you have a type of terminal that is 
not described in the database, the best way to proceed is to find a description 
of one that comes close to having the same capabilities as yours and building 
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on that one. There is a routine (setupterm) in cur8es(3X) that can be used to 
print out descriptions from the database. Once you have worked out the code 
that describes the capabilities of your terminal, the tic(lM) command is used 
to compile the entry and add it to the database, 

curses 

After you have made sure that the operating capabilities of your terminal 
are a part of the terminfo database, you can then proceed to use the routines 
that make up the cur8es(3X) package to create and manage screens for your 
application. 

The curses library includes functions to do the following: 

■ define portions of your terminal screen as windows 

■ define pads that extend beyond the borders of your physical terminal 
screen and let you see portions of the pad on your terminal 

■ read input from a terminal screen into a program 

■ write output from a program to your terminal screen 

■ manipulate the information in a window in a virtual screen area and 
then send it to your physical screen 

Where to Find More Information 

In the sample application at the end of this chapter, we show how you 
might use curses routines. Chapter 10 of this guide contains a tutorial on the 
subject. The manual pages for curses are in Section (3X), and those for ter- 
minfo are in Section (4) of the Programmer's Reference Manual 
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This section covers UNIX System components that are part of the pro- 
gramming environment, but that have a highly specialized use. We refer to 
such things as the following: 

■ link edit command language 

■ Common Object File Format 

■ libraries 

■ Symbolic Debugger 

■ lint as a portability tool 



Link Editor Command Language 

The link editor command language is for use when the default arrange- 
ment of the Id output will not do the job. The default locations for the stan- 
dard Common Object File Format sections are described in a.out(4) in the 
Programmer's Reference Manual 

On an 80386 Computer, when an a.out file is loaded into memory for 
execution, the text segment starts at location 0x0, and the data section starts at 
the next segment boundary after the end of the text. The stack begins at 
OxBFFFFFFF and grows to lower memory addresses. 

The link editor command language provides directives for describing dif- 
ferent arrangements. The two major types of link editor directives are 
MEMORY and SECTIONS. MEMORY directives can be used to define the 
boundaries of configured and unconfigured sections of memory within a 
machine, to name sections, and to assign specific attributes (read, write, exe- 
cute, and initialize) to portions of memory. SECTIONS directives, among a lot 
of other functions, can be used to bind sections of the object file to specific 
addresses within the configured portions of memory. 
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The need to control the link editor output becomes more urgent under 
two, possibly related, sets of circumstances. 

1 . Your application is large and consists of a lot of object files. 

2. The hardware your application is to run on is tight for space. 

Where to Find More Information 

Chapter 12 of this guide gives a detailed description of the subject. 

Common Object File Format 

A knowledge of COFF is fundamental to using the link editor command 
language. It is also good background knowledge for tasks such as the follow- 
ing: 

■ setting up archive libraries or shared libraries 

■ using the Symbolic Debugger 

The foUovdng system header files contain definitions of data structures of 
parts of the Common Object File Format: 

<syms.h> symbol table format 

<liiienum.h> line number entries 

<ldfcn.h> COFF access routines 

<filehdr.h> file header for a common object file 

<a.out,h> common assembler and link editor output 

<scnhdr.h> section header for a common object file 

<reloc.h> relocation information for a common object file 

<storcIass.h> storage classes for common object files 

The object file access routines are described below under the heading 
"The Object File Library." 

Where to Find IMore information 

Chapter 11 of this guide gives a detailed description of COFF. 
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Libraries 

A library is a collection of related object files and/or declarations that sim- 
plify programming effort. Programming groups involved in the development 
of applications often find it convenient to establish private libraries. For 
example, an application with a number of programs using a common database 
can keep the I/O routines in a library that is searched at link edit time. 

Prior to Release 3.0 of the UNIX System V, the libraries, whether system 
supplied or application developed, were collections of common object format 
files stored in an archive {filename.^) file that was searched by the link editor 
to resolve references. Files in the archive that were needed to satisfy 
unresolved references became a part of the resulting executable. 

Beginning with Release 3.0, shared libraries are supported. Shared 
libraries are similar to archive libraries in that they are collections of object 
files that are acted upon by the link editor. The difference, however, is that 
shared libraries perform a static linking between the file in the library and the 
executable that is the output of Id. The result is a saving of space because all 
executables that need a file from the library share a single copy. We go into 
shared libraries later in this section. 

In Chapter 2 we described many of the functions that are found in the 
standard C library, libc.a. The next two sections describe two other libraries, 
the object file library and the math library. 

The Object File Library 

The object file library provides functions for the access and manipulation 
of object files. Some functions locate portions of an object file such as the 
symbol table, the file header, sections, and line number entries associated with 
a function. Other functions read these types of entries into memory. The 
need to work at this level of detail with object files occurs most often in the 
development of new tools that manipulate object files. For a description of 
the format of an object file, see "The Common Object File Format" in 
Chapter 11. This library consists of several portions. 
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The functions (see Figure 3-2) reside in /lib/libld.a and are loaded during the 
compilation of a C language program by the -1 command line option 

cc file -lid 

which causes the link editor to search the object file library. The argument 
-Ud must appear after all files that reference functions in libld.a. 

The following header files must be included in the source code. 

#include <stdio,h> 
#iiiclude <a.oat.h> 
#ijiclude <ldfcn,h> 
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Function 


Reference 


Brief Description 


Idaclose 


ldclose(3X) 


closes object file being processed 


Idahread 


ldahread(3X) 


reads archive header 


Idaopen 


ldopen(3X) 


opens object file for reading 


Idclose 


ldcl08e(3X) 


closes object file being processed 


Idfhread 


ldfhread(3X) 


reads file header of object file being 
processed 


Idgetname 


ldgetname(3A) 


retrieves the name of an object file 
symbol table entry 


Idlinit 


ialreaa(aA; 


prepares ODjeci me ror rcaciing iiric 
number entries via Idlitem 


Idlitem 


ldlread(3X) 


reads line number entry from object file 
after Idlmit 


Idlread 


ldlread(3X) 


reads line number entry from object file 


Idlseek 


ldlseek(3X) 


seeks to the line number entries of the 
object file being processed 


Idnlseek 


ldl8eek(3X) 


seeks to the line number entries of the 
object file being processed given the 
name of a section 


Idnrseek 


ldrseek(3X) 


seeks to the relocation entries of the 
object file being processed given the 
name of a section 


Idnshread 


ldshread(3X) 


reads section header of the named sec- 
tion of the object file being processed 



Figure 3-2: Object File Library Functions (Sheet 1 Of 2) 
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Function 


Reference 


Brief Description 


Idnsseek 


ldsseek(3X) 


seeks to the section of the object file 
ueing proccsseu given rne name oi a 
section 






seeKs lo rne opuonai iiie neaaer oi tne 
object file being processed 






opens oDjeci rue lor reauing 


Idrseek 


ldrseek(3X) 


seeks to the relocation entries of the 
object file being processed 


Idshread 


ldshread(3X) 


reads section header of an object file 
being processed 


Idsseek 


ldsseek(3X) 


seeks to the section of the object file 
being processed 


Idtbindex 


ldtbindex(3X) 


returns the long index of the symbol 
table entry at the current position of the 
uujcLi iiic Dcing proccsseu. 


Idtbread 


ldtbread(3X) 


reads a specific symbol table entry of 
the object file being processed 


Idtbseek 


ldtbseek(3X) 


seeks to the symbol table of the object 
file being processed 


sgetl 


sputl(3X) 


accesses long integer data in a 
machine-independent format 


sputl 


sputl(3X) 


translates a long integer into a 
machine-independent format 



Figure 3-2: Object File Library Functions (Sheet 2 Of 2) 



3-26 PROGRAMMER'S GUIDE 



Programming Support Tools 



Common Object File Interface Macros (ldfcn.h) 

The interface between the calling program and the object file access rou- 
tines is based on the defined type LDFILE, which is in the header file ldfcn.h 
[see ldfcn(4)]. The primary purpose of this structure is to provide uniform 
access to both simple object files and to object files that are members of an 
archive file. 

The function ldopen(3X) allocates and initializes the LDFILE structure and 
returns a pointer to the structure. The fields of the LDFILE structure can be 
accessed individually through the following macros: 

■ TYPE — returns the magic number of the file, which is used to distin- 
guish between archive files and object files that are not part of an 
archive. 

■ lOPTR — returns the file pointer, which was opened by ldopen(3X) and 
is used by the input/output functions of the C library. 

■ OFFSET — returns the file address of the beginning of the object file. 
This value is non-zero only if the object file is a member of the archive 
file. 

■ HEADER — accesses the file header structure of the object file. 

Additional macros are provided to access an object file. These macros 
parallel the input/output functions in the C library; each macro translates a 
reference to an LDFILE structure into a reference to its file descriptor field. 
The available macros are described in ldfcn(4) in the Programmer's Reference 
ManuaL 

The Math Library 

The math library package consists of functions and a header file. The 
functions are located and loaded during the compilation of a C language pro- 
gram by the -1 option on a command line, as follows: 

cc file -Im 

This option causes the link editor to search the math library, libm.a. In 
addition to the request to load the functions, the header file of the math 
library should be included in the program being compiled. This is accom- 
plished by including the line 

#include <inath.h> 
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near the beginning of each file that uses the routines. 

The functions are grouped into the following categories: 

■ trigonometric functions 

■ Bessel functions 

■ hyperbolic functions 

■ miscellaneous functions 

Trigonometric Functions 

These functions are used to compute angles (in radian measure), sines, 
cosines, and tangents. All of these values are expressed in double-precision. 



Function 


Reference 


Brief Description 


acos 


trig(3M) 


returns arc cosine 


asin 


trig(3M) 


returns arc sine 


atan 


trig(3M) 


returns arc tangent 


atan2 


trig(3M) 


returns arc tangent of a ratio 


cos 


trig(3M) 


returns cosine 


sin 


trig(3M) 


returns sine 


tan 


trig(3M) 


returns tangent 



Bessel Functions 

These functions calculate Bessel functions of the first and second kinds of 
several orders for real values. The Bessel functions are jO, jl, jn, yO, yl, and 
yn. The functions are described in section 3 [bessel(3M)] of the Programmer's 
Reference Manual, 

Hyperbolic Functions 

These functions are used to compute the hyperbolic sine, cosine, and 
tangent for real values. 
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Function 


Reference 


Brief Description 


cosh 


sinh(3M) 


returns hyperbolic cosine 


sinh 


sinh(3M) 


returns hyperbolic sine 


tanh 


sinh(3M) 


returns hyperbolic tangent 



Miscellaneous Functions 

These functions cover a wide variety of operations, such as natural loga- 
rithm, exponential and absolute value. In addition, several are provided to 
truncate the integer portion of double-precision numbers. 
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Function 



Reference 



Brief Description 



ceil 

exp 

fabs 

floor 

fmod 

gamma 

hypot 

log 

loglO 

matherr 
pow 

sqrt 



floor(3M) 

exp(3M) 

floor(3M) 

floor(3M) 

floor(3M) 

gamma(3M) 

hypot(3M) 

exp(3M) 

exp(3M) 

matherr(3M) 
exp(3M) 

exp(3M) 



returns the smallest integer not less 
than a given value 

returns the exponential function of a 
given value 

returns the absolute value of a given 
value 

returns the largest integer not greater 
than a given value 

returns the remainder produced by the 
division of two given values 

returns the natural log of the absolute 
value of the result of applying the 
gamma function to a given value 

returns the square root of the sum of 
the squares of two numbers 

returns the natural logarithm of a given 
value 

returns the logarithm base ten of a 
given value 

Error-handling function 

returns the result of a given value 
raised to another given value 

returns the square root of a given value 



Shared Libraries 

As noted above, beginning with UNIX System V Release 3.0, shared 
libraries are supported. Not only are some system libraries (libc and the net- 
working library) available in both archive and shared library form, but also 
applications have the option of creating private application shared libraries. 
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The reason why shared libraries are desirable is that they save space, both 
on disk and in memory. With an archive library, when the link editor goes to 
the archive to resolve a reference, it takes a copy of the object file that it 
needs for the resolution and binds it into the a*out file. From that point on 
the copied file is a part of the executable, whether it is in memory to be run or 
sitting in secondary storage. If you have a lot of executables that use, say, 
printf (which just happens to require much of the standard I/O library) you 
can be talking about a sizeable amount of space. 

With a shared library, the link editor does not copy code into the execut- 
able files. When the operating system starts a process that uses a shared 
library, it maps the shared library contents into the address space of the pro- 
cess. Only one copy of the shared code exists, and many processes can use it 
at the same time. 

This fundamental difference between archives and shared libraries has 
another significant aspect. When code in an archive library is modified, all 
existing executables are uneffected. They continue using the older version 
until they are re-link edited. When code in a shared library is modified, all 
programs that share that code use the new version the next time they are exe- 
cuted. 

All this may sound like a really terrific deal, but as with most things in 
life there are complications. To begin with, in the paragraphs above we did 
not give you quite all the facts. For example, each process that uses shared 
library code gets its own copy of the entire data region of the library. It is 
actually only the text region that is really shared. So the truth is that shared 
libraries can add space to executing a.out's, even though the chances are good 
that they will cause more shrinkage than expansion. What this means is that 
when there is a choice between using a shared library and an archive, you 
should not use the shared library unless it saves space. If you were using a 
shared libc to access only strcmp, for example, you would pick up more in 
shared library data than you would save by sharing the text. 

The answer to this problem, and to others that are somewhat more com- 
plex, is to assign the responsibility for shared libraries to a central person or 
group within the application. The shared library developer should be the one 
to resolve questions of when to use shared and when to use archive system 
libraries. If a private library is to be built for your application, one person or 
organization should be responsible for its development and maintenance. 
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Where to Find More Information 

The sample application at the end of this chapter includes an example of 
the use of a shared library. Chapter 8 of this guide describes how shared 
libraries are built and maintained. 



Symbolic Debugger 

The use of sdb was mentioned briefly in Chapter 2. In this section we 
want to say a few words about sdb within the context of an application 
development project. 

sdb works on a process, and enables a programmer to find errors in the 
code. It is a tool a programmer might use while coding and unit testing a pro- 
gram, to make sure it runs according to its design, sdb would normally be 
used prior to the time the program is turned over, along with the rest of the 
application, to testers. During this phase of the application development 
cycle, programs are compiled with the -g option of cc to facilitate the use of 
the debugger. The symbol table should not be stripped from the object file. 
Once the programmer is satisfied that the program is error-free, strip(l) can 
be used to reduce the file storage overhead taken by the file. 

If the application uses a private shared library, the possibility arises that a 
program bug may be located in a file that resides in the shared library. Deal- 
ing with a problem of this sort calls for coordination by the administrator of 
the shared library. Any change to an object fUe that is part of a shared library 
means the change affects all processes that use that file. One program's bug 
may be another program's feature. 

Wliere to Find IMore information 

Chapter 15 of this guide contains information on how to use sdb. The 
manual page is in Section (1) of the Programmer's Reference Manual 



lint as a Portability Tool 

It is a characteristic of the UNIX System that language compilation sys- 
tems are somewhat permissive. Generally speaking, it is a design objective 
that a compiler should run fast. Most C compilers, therefore, let some things 
go unflagged as long as the language syntax is observed statement by state- 
ment. This sometimes means that while your program may run, the output 
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will have some surprises. It also sometimes means that while the program 
may run on the machine on which the compilation system runs, there may be 
real difficulties in running it on some other machine. 

That is where lint comes in. lint produces comments about inconsisten- 
cies in the code. The types of anomalies flagged by lint are as follows: 

■ cases of disagreement between the type of value expected from a 
called function and what the function actually returns 

■ disagreement between the types and number of arguments expected by 
functions and what the function receives 

■ inconsistencies that might prove to be bugs 

■ things that might cause portability problems 

Here is an example of a portability problem that would be caught by lint. 

Code such as this, 

int i = lseek(fdes, offset, vdience) 

would get by most compilers. However, Iseek returns a long integer 
representing the address of a location in the file. On a machine with a 16-bit 
integer and a bigger long int, it would produce incorrect results, because i 
would contain only the last 16 bits of the value returned. 

Since it is reasonable to expect that an application written for a UNIX Sys- 
tem machine will be able to run on a variety of computers, it is important that 
the use of lint be a regular part of the application development. 

Where to Find More Information 

Chapter 16 of this guide contains a description of lint with examples of 
the kinds of conditions it uncovers. The manual page is in Section (1) of the 
Programmer's Reference Manual, 
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Volumes have been written on the subject of project control. It is an item 
of top priority for the managers of any application development team. Two 
UNIX System tools that can play a role in this area are described in this sec- 
tion. 



make 

The make command is extremely useful in an application development 
project for keeping track of object files that need to be recompiled as changes 
are made to source code files. One of the characteristics of programs in a 
UNIX System environment is that they are made up of many small pieces, 
each in its own object file, that are link edited together to form the executable 
file. Quite a few of the UNIX System tools are devoted to supporting that 
style of program architecture. For example, archive libraries, shared libraries, 
and the fact that the cc command accepts files as well as .c files and that it 
can stop short of the Id step and produce .o files instead of an a.out, are all 
important elements of modular architecture. The two main advantages of this 
type of programming are that 

■ A file that performs one function can be re-used in any program that 
needs it. 

■ When one function is changed, the whole program does not have to be 
recompiled. 

On the flip side, however, a consequence of the proliferation of object files 
is an increased difficulty in keeping track of what does need to be recompiled 
and what does not. make is designed to help deal with this problem. You 
use make by describing in a specification file, called makefile, the relationship 
(that is, the dependencies) between the different files of your program. Once 
having done that, you conclude a session in which possibly a number of your 
source code files have been changed by running the make command, make 
takes care of generating a new a.out, by comparing the time-last-changed of 
your source code files with the dependency rules you have given it. 

make has the ability to work with files in archive libraries or under con- 
trol of the Source Code Control System (SCCS). 
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Where to Find More Information 

The make(l) manual page is contained in the Programmer's Reference 
ManuaL Chapter 13 of this guide gives a complete description of how to use 
make. 



sees 

sees is an abbreviation for Source Code Control System. It consists of a 
set of 14 commands used to track evolving versions of files. Its use is not lim- 
ited to source code; any text files can be handled, so an application's docu- 
mentation can also be put under control of SCCS. SCCS can do the follow- 
ing: 

■ store and retrieve files under its control 

■ allow no more than a single copy of a file to be edited at one time 

■ provide an audit trail of changes to files 

■ reconstruct any earlier version of a file that may be wanted 

SCCS files are stored in a special coded format. Only through commands 
that are part of the SCCS package can files be made available in a user's 
directory for editing, compiling, etc. From the point at which a file is first 
placed under SCCS control, only changes to the original version are stored. 
For example, let us say that the program, restate, that was used in several 
examples in Chapter 2, was controlled by SCCS. 
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One of the original pieces of that program is a file called oppty.c that looks 
• like the following: 



/♦ Qpp or tjmity Cbst — oppty .c */ 
#iiicli2de "recxief .h" 

float 
oppty(ps) 
struct rec *ps; 
{ 

retum(ps->i/12 * ps->t ♦ ps->dlp) ; 

} 



If you decide to add a message to this function, you might change the file 
like the following: 




/* Ojpportunity Cost ~ oppty.c */ 

^include "reodef .h" 
#iiiclTjde <stclio.h> 



float 
oppty(ps) 
struct rec *ps; 
{ 



(void) fpriiitf(stderr, "Opportunity callingNn"); 
retum(ps->i/12 * ps->t * ps->dp) ; 

} 
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sees saves only the two new lines from the second version, with a coded 
notation that shows where in the text the two lines belong. It also includes a 
note of the version number, lines deleted, lines inserted, total lines in the file, 
the date and time of the change, and the login id of the person making the 
change. 

Where to Find More Information 

Chapter 14 of this guide is an SCCS user's guide. SCCS commands are in 
Section (1) of the Programmer's Reference Manual. 
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liber, A Library System 

To illustrate the use of UNIX System programming tools in the develop- 
ment of an application, we are going to pretend we are engaged in the 
development of a computer system for a library. The system is known as 
liber. The early stages of system development, we assume, have already been 
completed; feasibility studies have been done and the preliminary design is 
described in the coming paragraphs. We are going to stop short of producing 
a complete detailed design and module specifications for our system. You will 
have to accept that these exist. In using portions of the system for examples 
of the topics covered in this chapter, we will work from these virtual specifica- 
tions. 

We make no claim as to the efficacy of this design. It is the way it is, 
only in order to provide some passably realistic examples of UNIX System 
programming tools in use. 

liber is a system for keeping track of the books in a library. The 
hardware consists of a single computer with terminals throughout the library. 
One terminal is used for adding new books to the database. Others are used 
for checking out books and as electronic card catalogs. 

The design of the system calls for it to be brought up at the beginning of 
the day and remain running while the library is in operation. The system has 
one master index that contains the unique identifier of each title in the library. 
When the system is running, the index resides in memory. Semaphores are 
used to control access to the index. In the pages that follow, fragments of 
some of the system's programs are shown to illustrate the way they work 
together. The startup program performs the system initialization: opening the 
semaphores and shared memory, reading the index into the shared memory, 
and kicking off the other programs. The id numbers for the shared memory 
and semaphores (shmid, wrtsem, and rdsem) are read from a file during ini- 
tialization. The programs all share the in-memory index. They attach it with 
the foUovrfng code: 
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/* attach shared memory for ijidex */ 

if ((int) (index = (INDEX *) shmat(shndd, NULL, 0)) == -1) 
{ 

(void) fprintf (stderr, "shmat failed: 56d\n", ermo) ; 
exit(l); 

} 



Of the programs shown, add-books is the only one that alters the index. 
The semaphores are used to ensure that no other programs will try to read the 
index while add-books is altering it. The checkout program locks the file 
record for the book so that each copy being checked out is recorded 
separately, and the copy cannot be checked out at two different checkout sta- 
tions at the same time. 

The program fragments do not provide any details on the structure of the 
index or the book records in the database. 



/* liber. h - header file for the 
* library system. 

*/ 

typedef . . . INDEX; /* data structure for book file index */ 
typedef struct { /* type of records in book file */ 

char title[30]; 

char author[30]; 



} HDOK; 
int shmid; 
int wrtsem; 
int rdsem; 
INIS2C *index; 

int bock_file; 
BOCK bodk_buf ; 
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r 



continued 




/* startup program ♦/ 
/* 

* 1. Open shared menory for file index and read it in. 

* 2. Open tM9 sesaphores for providing exclusive vorite access to index. 

* 3. Stash id's for shared roemoEy segment and seitaphores in a file 

* where they can be accessed by the programs. 

* 4. Start programs: add-books, card-catalog, and checkDat running 

* cni the various terminals throughout the library. 
*/ 

#include <stdio.h> 
#incl\2de <sys/types . h> 
#include <sys/ipc.h> 
#include <sys/shm.h> 
^include <sys/sem.h> 
#include "liber .h" 

void exit(); 
extern int ermo; 

kejrjt key; 
int shmid; 
int vnrtsem; 
int rdsGm; 
FILE ♦ipc_file; 



if ((shmid = shmget(key, sizeof (INDEX) , IPCCREAT | 0666)) == -1) 
{ 

(void) fprintf (stderr, "startup): storget failed: ermo=^\n", ermo); 
exit(l); 

} 

if ((wrtsem = S€nget(key, 1, IPCCREAT | 0666)) == -1) 
{ 

(void) fprintf (Stderr, "startup: semget failed: ermo=%i\n", ermo); 
exit(1); 

} 



nain( ) 
{ 
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continued 



if ((rdsem = S€nget(kfiy, 1, IPCCREAT | 0666)) = -1) 
{ 

(void) fprintf (stderr, "startup: semget failed: ermo=%d\n", ermo); 
exitCI); 

} 

(void) fprintf (ipc_file, "5«\i^\ii?Sd\n" , shmid, vnrtsem, rdsem) ; 
/• 

* Start the add-books program running an the tenrdual in the 

* basement. Start the checkout and card-catalog programs 

* running on the various other terminals throuc^iout the lihrazy. 
♦/ 



} 



/* card-catalog program*/ 



/♦ 

* 1. Read screen for author and title. 

* 2. U^e semphores to prevent reading index v^le it is being written. 

* 3. Use index to get position of book record in book file. 

* 4. Print book record cn screen or indicate bock vas not fcwnd. 

* 5. Go to 1. 
*/ 



#include <stdio.h> 
#include <sys/types .h> 

#include <£Eys/ipc.h> 
#include <sys/SQa.h> 

#include <fcntl.h> 

#incl\aae "liber .h" 



void exit( ) ; 
extern int ermo; 
struct senbuf SGp[1]; 



mainO { 



V 
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r 



continued 




vihile (1) 

{ 

/* 

* Read author/title/subject information fron screen. 
*/ 

/* 

* Wait for write semaphore to reach (miex not being written), 
*/ 

sop[0].semjDp = 1; 

if (S€nDp(wrtseni, sop, 1) == -1) 

{ 



(void) fprintf(stderr, "seinop failed: %d\n", ermo); 
exit(1); 



} 

/* 

* Increment read setaphore so potential writer will wait 

* for us to finish reading the index. 
♦/ 

sop[0],sem_op = 0; 

if (serrDp(rdsem, sop, 1) == -1) 

{ 



(void) fprintf(stderr, "sorcp failed: %a>sn", ermo); 
exit(1); 



/* Use index to find file pointer(s) for bobk(s) */ 

/* Decrement read senafto:e */ 

sop[0].s€mj:5) = -1; 

if (semopCrdsem, sop, 1) == -1) 

{ 



(void) fprintf(stderr, "seacp failed: 96d\n", ermo); 
exit(1); 



/* 

* Now we use the file pointers found in the index to 

* read the bxk file. Then we print the infomation 

* on the bobk(s) to the screen, 
♦/ 



} 



} 




} /* ^le */ 

} 

/♦ chedcDut program*/ 




3-42 



PROGRAMMER'S GUIDE 



SAMPLE APPLICATION: liber 




continued 




/♦ 

* 1. Read screen for Dewey Decinal xxuniber of bock to be checked out. 

* 2. Use senaphores to prevent reading index vdiile it is being written. 

* 3. Use incbex to get position of book record an book file. 

-K- 4. If book xx3t found print message on screen, otherwise lock 

* bock record and read. 

* 5. If book already checked out, print message on screen, otherwise 

* nark record "checked out" and write back to book file. 

* 6. Unlock book record. 

* 7. Go to 1. 
*/ 

#inclttde <stdio.h> 
#inclt2de <sys/types .h> 

#includfi <sys/ipc,h> 
#include <sys/sem.h> 
#include <fcntl.h?- 
#inclia2e "liber .h" 

void exit( ) ; 
long IseekO; 
extern int ermo; 
strxict flock flk; 
struct sentouf sop[1]; 
long boolqx^s; 



while (1) 
{ 

/♦ 

« Bead Dewey Decimal nuinber frcni screen. 
♦/ 



nain( ) 
{ 
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continued 




/♦ 

* Wait for vnrite senaphore to reach (index not being written) . 
*/ 

soptO].sem_flg = 0; 

sop[0].seDaj3p = 0; 

if (seniop(wrtsem, sop, 1) = -1) 

{ 



(void) fprintf (stderr, "seacp failed: %d\n", ermo); 
exitd); 



} 

/♦ 

♦ liKareinent read senaphore so that potential writer will 

* wait for us to finish reading the index. 
♦/ 

sop[0].sem_op = 1; 

if (s€mop(rdsem, sop, 1) == -1) 

{ 



(void) fprintf (stderr, "semop failed: %d\n", errro); 
exit(1); 



/♦ 

* Now we can use the index to find the book's record position. 

* Assign this value to "booIqx>s". 
♦/ 

/* Decrenent read semphore V 

sop[0],sem_pp = -1; 

if (senDp(rdsem, sop, 1) -1) 

{ 



(void) i^irintf (stderr, "semop failed: 9fid\n", ermo); 
exitd); 



/♦ Lock the book's record in book file, read the record. ♦/ 

flk.ljtype = FWRICK; 

flk.l^vdience = 0; 

flk.l_start = booUqpos; 

flk,l_len = sizeof(BOaK); 

if (fcntl(b0Gk_file, PJSBrLKM, &flk) == -1) 



} 



} 
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(void) fprijitf (stderr, "trouble loddiKf; 96d\n", ermo); 
exit(1); 



if (lseek(bodk_file, bcio3qx>s, 0) == -1) 

Error processing for Iseek; 
if (read(book_file, Sbexikbuf, sizeof (BCXK) ) = -1) 

Error processing for read; 



/* 

* If the book is checked out Infozm the client, otherwise 

* nark the book's record as checked out and write it 

* back into the book file. 
V 

/♦ Unlock the book's record in book file. */ 
flk.ltype = FUNIXK; 

if (fcntl(book_file, F_SEELK, &flk) == -1) 
{ 

(void) fprintf (stderr, "trouble unlocking: 96d\n", ermo); 
exit(l); 

} 

} /* while */ 

} 

/* add-books program*/ 
/* 

* 1 , Read a new book entry f ran screen. 

* 2. Insert book in book file. 

* 3. 0se seoBphare "wrtsen" to block new readers. 

* 4. Wcdt for semaiahore "rdsem" to reach 0. 

* 5. Insert book into index. 

* 6. Decrenent wrtsem* 

* 7. Go to 1. 
♦/ 




APPLICATION PROGRAMMING 3-45 



SAMPLE APPLICATION: liber 




continued 




#incl\jde <stdio.lP' 
#ijiclude <sys/types . h> 
#lnclude <sys/ipc.h> 
#include <sys/sem.h> 
#i2iclude "liber.h" 

void exit( ) ; 
extern int ermo; 
stzuct sembuf sop[1]; 
BOGK bodhtfuf; 

main{ ) 
{ 



/* 

* Read infoinnaticn cn new took frcm scareen. 
♦/ 

addscr (&boohbuf ) ; 

/* write new record at the end of the bobkfile, 

* Code not shown, but 

* addscr( ) returns a 1 if title infcnnatian has 
been entered, if not. 

V 

/♦ 

* Increment write semaphore, blocking new readers from 

* accessing the index. 
V 

sop[0].sem_fl9 = 0; 

sqp[0].semj3p = 1; 

if (semop(wrts€m, sop, 1) = -1) 

{ 



(void) fprintf(stderr, "s&rop failed: 96d\n", ermo); 
exit(l); 



far (;;) 
{ 



} 
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continued 



/♦ 

* Wait for read semaitore to reach (all readers to finish 

* using the index). 
V 

scp[0] .sem_pp = 0; 
if (ssiiop(rdsGm, sop, 1) == -1) 
< 

(void) fprintf (stderr, "sempp failed: 96d\n", ermo); 
exit(1); 

} 

/* 

* Now that we have exclusive access to the index we 

* insert our new book with its file pointer. 
♦/ 

/* Decrement write senephore, permitting readers to read index. */ 
sop[0]. sempp = -1; 
if (sQnop(wrtsGm, sop, 1) == -1) 
{ 

(void) fprintf (stderr, "semop failed: %i\n", ermo); 
exit(l); 

} 

} /* for */ 




The example following, addscrO/ illustrates two significant points about 
curses screens: 

1 . Information read in from a curses window can be stored in fields that 
are part of a structure defined in the header file for the application. 

2. The address of the structure can be passed from another function 
where the record is processed. 
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SAMPLE APPLICATION: liber 



/♦ addscx is called fran add-bobks. 

* Ihe user is proipted for title 

* infoxnaticai. 



#iiiclude <curses.h> 



WINDCW ♦andwin; 



addscrCbb) 
struct BOOK *bto; 
{ 

int c; 



initscrO; 
nanlO; 
noecho( ); 
ciireak( ) ; 



anchodn = newwi2i(6, 40, 3, 20); 

mvpriiitw(0, 0, "Ihis screen is for adding titles to the database" ) ; 

irrvpriiitw( 1 , , "Enter a to add; q to quit: " ) ; 

refresh( ) ; 

for (;;) 

{ 

refresh( ) ; 
c = getch( ) ; 
switch (c) { 
'a': 

wsrase( andwin); 
box(cndwin, '|', '-'); 
ravvgprintw(cpQdwin, 1, 1, "Enter title: "); 
vatDve(cnidwiii, 2, 1); 
echoO; 

wrefresh( andwin) ; 
\^tstr( andwin, b4>->title); 
noechoO; 
werase(aDdwiii) ; 
box(andwin. '1', '-'); 

inwprintw(aidwin, 1, 1, "Enter author: "); 
MDOve( andwin, 2, 1); 
echo{); 

wre£resh( andwin) ; 
wgetstr( andwin, bb->authar); 
noecho( ); 
werctse( andwin) ; 
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continued 




wre£resh(canclwin) ; 

endwinO; 

reftum(1); 



case 'q': 



erase( ) ; 
endwiii( ) ; 
retum(0}; 



} 



} 



} 



# f&Gcefile far liber library system 
# 

OC = cc 
CFLA3S = -O 

all: startup add-books checkout card-catalog 

startup: liber.h startup. c 

$(CC) $(CET*AGS) -o startuqp startcqp.c 

add-bocks: add-bocks.o addscr.o 

$(0C) $(CEIiAGS) -o add-books add-books.o addscr.o 

add-bocks.o: liber .h 

checkout; liber. h checkDut.c 

$(0C) $(CELAGS) -o checkDut checkout. c 

card-catalog: liber. h card-catalog. c 

$(CX:) ${CPtifl3S) -o card-catalog card-catalog, c 
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Introduction 



This chapter describes the new version of awk released in UNIX System 
NOTE V Release 3.1 and described in nawk(l) in the User's /System 

I Administrator's Reference Manual. An earlier version is described in 
I awk(l). The new version will become the default in the next major 

UNIX System release. Until then, you should read nawk for awk in this 
chapter. 



Suppose you want to tabulate some survey results stored in a file, print 
various reports summarizing these results, generate form letters, reformat a 
data file for one application package to use with another package, or count the 
occurrences of a string in a file, awk is a programming language that makes it 
easy to handle these and many other tasks of information retrieval and data 
processing. The name awk is an acronym constructed from the initials of its 
developers; it denotes the language and also the UNIX System command you 
use to run an awk program. 

awk is an easy language to learn. It automatically does quite a few things 
that you have to program for yourself in other languages. As a result, many 
useful awk programs are only one or two lines long. Because awk programs 
are usually smaller than equivalent programs in other languages, and because 
they are interpreted, not compiled, awk is also a good language for prototyp- 
ing. 

The first part of this chapter introduces you to the basics of awk and is 
intended to make it easy for you to start writing and running your own awk 
programs. The rest of the chapter describes the complete language and is 
somewhat less tutorial. For the experienced awk user, there's a summary of 
the language at the end of the chapter. 

You should be familiar with the UNIX System and shell programming to 
use this chapter. Although you don't need other programming experience, 
some knowledge of the C programming language is beneficial, because many 
constructs found in awk are also found in C. 
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This section provides enough infonnation for you to write and run some 
of your own programs. Each topic presented is discussed in more detail in 
later sections. 



Program Structure 

The basic operation of awk(l) is to scan a set of input lines one after 
another, searching for lines that match any of a set of patterns or conditions 
you specify. For each pattern, you can specify an action; this action is per- 
formed on each line that matches the pattern. Accordingly, an awk program 
is a sequence of pattern-action statements, as Figure 4-1 shows. 



Structure: 



pattern { action } 
pattern { action } 



Example: 

$1 == "address" { print $2, $3 } 



Figure 4-1: awk Program Structure and Example 



The example in the figure is a t)T>ical awk program, consisting of one 
pattern-action statement. The program prints the second and third fields of 
each input line whose first field is address. In general, awk programs work 
by matching each line of input against each of the patterns in turn. For each 
pattern that matches, the associated action (which may involve multiple steps) 
is executed. Then the next line is read, and the matching starts over. This 
process typically continues until all the input has been read. 
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Either the pattern or the action in a pattern-action statement may be omit- 
ted. If there is no action with a pattern, as in 

$1 == "name" 

the matching line is printed. If there is no pattern with an action, as in 
{ print $1, $2 } 

the action is performed for every input line. Since patterns and actions are 
both optional actions are enclosed in braces to distinguish them from pat- 
terns. 

Usage 

There are two ways to run an awk program. First, you can type the com- 
mand line 

awk 'pattern-action statements^ optional list of input files 

to execute the pattern-action statements on the set of named input files. For 
example, you could say 

awk '{ print $1, $2 }' filel file2 

Notice that the pattern-action statements are enclosed in single quotes. This 
protects characters like $ from being interpreted by the shell and also allows 
the program to be longer than one line. 

If no files are mentioned on the command line, awk(l) reads from the 
standard input. You can also specify that input comes from the standard 
input by using the hyphen ( - ) as one of the input files. For example, 

awk '{ print $3, $4 }' filel - 

says to read input first from filel and then from the standard input. 

The arrangement above is convenient when the awk program is short (a 
few lines). If the program is long, it is often more convenient to put it into a 
separate file and use the -f option to fetch it: 

awk -f program file optional list of input files 

For example, the following command line says to fetch and execute mypro- 
gram on input from the file filel: 

awk -f myprogram filel 
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Fields 

awk normally reads its input one line, or record, at a time; a record is, by 
default, a sequence of characters ending with a newline character, awk then 
splits each record into fields, where, by default, a field is a string of non- 
blank, non-tab characters. 

As input for many of the awk programs in this chapter, we use the file 
countries, which contains information about the ten largest countries in the 
world. Each record contains the name of a country, its area in thousands of 
square miles, its population in millions, and the continent on which it is 
found. (Data are from 1978; the U.S.S.R. has been arbitrarily placed in Asia.) 
The white space between fields is a tab in the original input; a single blank 
separates North and South from Anfterioa . 




USSR 


8650 


262 


Asia 


Canada 


3852 


24 


North America 


China 


3692 


866 


Asia 


USA 


3615 


219 


North America 


Brazil 


3286 


116 


South America 


Australia 


2968 


14 


Australia 


India 


1269 


637 


Asia 


Argentina 


1072 


26 


South America 


Sudan 


968 


19 


Africa 


Algeria 


920 


18 


Africa 




Figure 4-2: The Sample Input File countries 



This file is typical of the kind of data awk is good at processing — a mixture 
of words and numbers separated into fields by blanks and tabs. 

The number of fields in a record is determined by the field separator. 
Fields are normally separated by sequences of blanks and/or tabs, so the first 
record of countries would have four fields, the second five, and so on. It's 
possible to set the field separator to just tab, so each line would have four 
fields, matching the meaning of the data. We'll show how to do this shortly. 
For the time being, we'll use the default; fields separated by blanks and/or 
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tabs. The first field within a line is called $1, the second $2, and so forth. 
The entire record is called $0. 

Printing 

If the pattern in a pattern-action statement is omitted, the action is exe- 
cuted for all input lines. The simplest action is to print each line; you can 
accomplish this with an awk program consisting of a single print statement 

{ podnt } 

so the command line 

awk '{ print }' countries 

prints each line of countries, copying the file to the standard output. The 
print statement can also be used to print parts of a record; for instance, the 
program 

{ print $1, $3 } 
prints the first and third fields of each record. Thus, 

awk '{ print $1, $3 }' countries 
produces as output the sequence of lines: 




USSR 262 
Canada 24 
China 866 



USA 219 
Brazil 116 
Australia 14 
India 637 
Argentina 26 
Sudan 19 
Algeria 18 
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When printed, items separated by a comma in the print statement are 
separated by the output field separator, which, by default, is a single blank. 
Each line printed is terminated by the output record separator, which by 
default is a newline. 



NOTE 



In the remainder of this chapter, we only show awk programs, without the 
command line that invokes them. Each complete program can be run, 
either by enclosing it in quotes as the first argument of the awk command, 
or by putting it in a file and invoking awk with the -f flag, as discussed in 
"awk Command Usage." In an example, if no input is mentioned, the 
input is assumed to be the file countries. 



Formatted Printing 

For more carefully formatted output, awk provides a C-like printf state- 
ment 

ptintf format, expVi, expr2, . . expr„ 

which prints the expr^s according to the specification in the string format For 
example, the awk program 

{ printf "%10s %6d\ji", $1, $3 } 

prints the first field ($1) as a string of 10 characters (right justified), then a 
space, then the third field ($3) as a decimal number in a six-character field, 
then a newline (\n). With input from the file countries, this program prints 
an aligned table: 
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USSR 


262 


Canada 


24 


China 


866 


USA 


219 


Brazil 


116 


Australia 


14 


Iiidia 


637 


Argentina 


26 


Sudan 


19 


Algeria 


18 




With printf, no output separators or newlines are produced automatically; 
you must create them yourself by using \n in the format specification. "The 
printf Statement" in this chapter contains a full description of printf. 



Simple Patterns 

You can select specific records for printing or other processing by using 
simple patterns, awk has three kinds of patterns. First, you can use patterns 
called relational expressions that make comparisons. For example, the opera- 
tor == tests for equality. To print the lines for which the fourth field equals 
the string Asia, we can use the program consisting of the single pattern 

$4 == "Asia" 

With the file countries as input, this program yields 

USSR 8650 262 Asia 
China 3692 866 Asia 
India 1269 637 Asia 

The complete set of comparisons is >, >=, <, <=, == (equal to) and != 
(not equal to). These comparisons can be used to test both numbers and 
strings. For example, suppose we want to print only countries with a popula- 
tion greater than 100 million. The program 

$3 > 100 
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is all that is needed. (Remember that the third field in the file countries is the 
population in millions.) It prints all lines in which the third field exceeds 100. 

Second, you can use patterns called regular expressions that search for 
specified characters to select records. The simplest form of a regular expres- 
sion is a string of characters enclosed in slashes: 

/US/ 

This program prints each line that contains the (adjacent) letters US anywhere; 
with the file countries as input, it prints 

USSR 8650 262 Asia 

USA 3615 219 Nbrth America 

We will have a lot more to say about regular expressions later in this chapter. 

Third, you can use two special patterns, BEGIN and END, that match 
before the first record has been read and after the last record has been pro- 
cessed. This program uses BEGIN to print a title: 

BEGIN { print "Countries c£ Asia:" } 
/Asia/ { print " $1 } 

The output is 

C3ountries of Asia: 
USSR 
China 
India 



Simple Actions 

We have already seen the simplest action of an awk program: printing 
each input line. Now let's consider how you can use built-in and user-defined 
variables and functions for other simple actions in a program. 

Built-in Variables 

Besides reading the input and splitting it into fields, awk(l) counts the 
number of records read and the number of fields within the current record; 
you can use these counts in your awk programs. The variable NR is the 
number of the current record, and NF is the number of fields in the record. 
So the program 
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{ print NR, NP } 

prints the number of each line and how many fields it has, while 

{ print NR, $0 } 
prints each record preceded by its record number. 

User-defined Variables 

Besides providing built-in variables like NF and NR, awk lets you define 
your own variables, which you can use for storing data, doing arithmetic, and 
the like. To illustrate, consider computing the total population and the aver- 
age population represented by the data in the file countries. 

{ sum = sum + $3 } 
END { print "Total population is", sum, "million" 

print "Average population of" , NR, "countries is" , sum/NR } 



awk initializes sum to zero before it is used. 

NOTE 



The first action accumulates the population from the third field; the second 
action, which is executed after the last input, prints the sum and average: 

Tbtal population is 2201 million 

Average population of 10 coimtries is 220.1 



Functions 

awk has built-in functions that handle common arithmetic and string 
operations for you. For example, there's an arithmetic function that computes 
square roots. There is also a string function that substitutes one string for 
another, awk also lets you define your own functions. Functions are 
described in detail in the section "Actions" in this chapter. 



awk 4-9 



Basic awk 



A Handful of Useful One-liners 

Although awk can be used to write large programs of some complexity, 
many programs are not much more complicated than what we've seen so far. 
Here is a collection of other short programs that you may find useful and 
instructive. They are not explained here, but any new constructs do appear 
later in this chapter. 

Print last field of each input line: 
{ porint $NF } 

Print 10th input line: 
NR = 10 

Print last input line: 

{ line = $0} 
END { print line } 

Print input lines that don't have four fields: 

NF 1= 4 { print $0, "does not have 4 fields" } 

Print input lines with more than four fields: 
NF > 4 

Print input lines with last field more than 4: 
$NF > 4 

Print total number of input lines: 
END { print NR } 

Print total number of fields: 

{ nf = nf + NF } 
END { print nf } 

Print total number of input characters: 

{ nc = nc + length($0) } 
END { print nc + NR } 
(Adding NR includes in the total the number of newlines,) 

Print the total number of lines that contain the string Asia: 

/Asia/ { nlines++ } 

END { print nlines } 
(The statement nlines++ has the same effect as nlines = nlines 
+ 1.) 
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Error Messages 

If you make an error in your awk program, you generally get an error 
message. For example, trying to run the program 

$3 < 200 { print ( $1 } 

generates the error messages 



av^: syntax error at source line 1 
context is 

$3 < 200 { print ( »> $1 } «< 
av^: illegal stateDoont at source line 1 
1 extra ( 



Some errors may be detected while your program is running. For example, if 
you try to divide a number by zero, awk stops processing and reports the 
input record number (NR) and the line number in the program. 
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In a pattern-action statement, the pattern is an expression that selects the 
records for which the associated action is executed. This section describes the 
kinds of expressions that may be used as patterns. 

BEGIN and END 

BEGIN and END are two special patterns that give you a way to control 
initialization and wrap-up in an awk program. BEGIN matches before the 
first input record is read, so any statements in the action part of a BEGIN are 
done once, before the awk command starts to read its first input record. The 
pattern END matches the end of the input, after the last record has been pro- 
cessed. 

The following awk program uses BEGIN to set the field separator to tab 
(\t) and to put column headings on the output. The field separator is stored 
in a built-in variable called FS. Although FS can be reset at any time, usually 
the only sensible place is in a BEGIN section, before any input has been read. 
The program's second printf statement, which is executed for each input line, 
formats the output into a table, neatly aligned under the column headings. 
The END action prints the totals. (Notice that a long line can be continued 
after a comma.) 



BEGIN { FS = "\t" 

printf "%10s %6s %5s 5fe\n", 

"cxxttmor", "area", "pop", "OcwpiNENr" } 

{ printf "%10s %6d Jffid %s\n", $1, $2, $3, $4 
area = area + $2; pop = pop + $3 } 
END { printf "\ii5610s %6d %5d\n", "TCXEAL", area, pop } 

V 

With the file countries as input, this program produces 
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AREA 


POP 


cxirnNENr 


OSSR 


8650 


262 


Asia 


Canada 


3852 


24 


North America 


ChiDa 


3692 


866 


Asia 


USA 


3615 


219 


North America 


Brazil 


3286 


116 


South America 


Australia 


2968 


14 


Australia 


India 


1269 


637 


Asia 


Argentina 


1072 


26 


South America 


Sudan 


968 


19 


Africa 


Algeria 


920 


18 


Africa 


TOTAL 


30292 


2201 





Relational Expressions 

An awk pattern can be any expression involving comparisons between 
strings of characters or numbers, awk has six relational operators and two 
regular expression matching operators, ~ (tilde) and I which are discussed 
in the next section, for making comparisons. Figure 4-3 shows these operators 
and their meanings. 
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Operator Meaning 



< 


less than 


<= 


less than or equal to 




equal to 


!= 


not equal to 


>= 


greater than or equal to 


> 


greater than 




matches 




does not match 



Figure 4-3: awk Comparison Operators 



In a comparison, if both operands are numeric, a numeric comparison is 
made; otherwise, the operands are compared as strings, (Every value might 
be either a number or a string; usually awk can tell what is intended. The 
section "Number or String?" contains more information about this.) Thus, 
the pattern $3>100 selects lines where the third field exceeds 100, and the 
program 

$1 >= "S" 

selects lines that begin with the letters S through Z, namely, 

USSR 8650 262 Asia 

USA 3615 219 North America 

Sudan 968 19 Africa 

In the absence of any other information, awk treats fields as strings, so 
the program 

$1 == $4 

compares the first and fourth fields as strings of characters, and with the file 
countries as input, prints the single line for which this test succeeds: 

Australia 2968 14 Australia 

If both fields appear to be numbers, the comparisons are done numerically. 
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Regular Expressions 

awk provides more powerful patterns for searching for strings of charac- 
ters than the comparisons illustrated in the previous section. These patterns 
are called regular expressions, and are like those in egrep(l) in the 
User's/System Administrator's Reference Manual and lex(l) in the Programmer's 
Reference Manual The simplest regular expression is a string of characters 
enclosed in slashes, like 

/Asia/ 

This program prints all input records that contain the substring Asia. (If a 
record contains Asia as part of a larger string like Asian or Pan-Asiatic, it is 
also printed.) In general, if re is a regular expression, then the pattern 

/re/ 

matches any line that contains a substring specified by the regular expression 
re. 

To restrict a match to a specific field, you use the matching operators 
(matches) and I " (does not match). The program 

$4 ~ /Asia/ { print $1 } 

prints the first field of all lines in which the fourth field matches Asia, while 
the program 

$4 r /Asia/ { print $1 } 

prints the first field of all lines in which the fourth field does not match Asia . 

In regular expressions, the symbols 

X'^ $.[]* + ?0 I 

are metacharacters with special meanings like the metacharacters in the UNIX 
System shell. For example, the metacharacters and $ match the beginning 
and end, respectively, of a string, and the metacharacter • ("dot") matches 
any single character. Thus, 

/^.$/ 

matches all records that contain exactly one character. 
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A group of characters enclosed in brackets matches any one of the 
enclosed characters; for example, / [ABC] / matches records containing any 
one of A, B, or C anywhere. Ranges of letters or digits can be abbreviated 
within brackets: / [a-zA-Z] / matches any single letter. 

If the first character after the left bracket ([) is a carat C^), this comple- 
ments the class so it matches any character not in the set: / ['^a-zA-Z] / 
matches any non-letter. The program 

$2 !~ /'^ [0-9] + $/ 

prints all records in which the second field is not a string of one or more digits 

for beginning of string, [0-9]+ for one or more digits, and $ for end of 
string). Programs of this nature are often used for data validation. 

Parentheses () are used for grouping and the symbol | is used for alterna- 
tives. The program 

/(applelcherxy) (pie | tart)/ 

matches lines containing any one of the four substrings apple pie, apple 
tart, cherry pie, or ciherry tart . 

To turn off the special meaning of a metacharacter, precede it by a \ 
(backslash). Thus, the program 

/b\$/ 

prints all lines containing b followed by a dollar sign. 

In addition to recognizing metacharacters, the awk command recognizes 
the following C programming language escape sequences within regular 
expressions and strings: 



\b 


backspace 


\f 


formfeed 


\n 


newline 


\r 


carriage return 


\t 


tab 


\ddd 


octal value ddd 


\" 


quotation mark 


\c 


any other character c literally 



For example, to print all lines containing a tab, use the program 
/\t/ 
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awk interprets any string or variable on the right side of a or as a 
regular expression. For example, we could have written the program 

$2 !~ /'^ [0-9] + $/ 

as 

BEGIN { digits = "^[0-9]+$" } 
$2 r digits 



Suppose you wanted to search for a string of characters like ^ [ 0-9 ] + $ 
When a literal quoted string like "^^[0-9] + $" is used as a regular expression, 
one extra level of backslashes is needed to protect regular expression meta- 
characters. This is because one level of backslashes is removed when a string 
is originally parsed. If a backslash is needed in front of a character to turn off 
its special meaning in a regular expression, then that backslash needs a 
preceding backslash to protect it in a string. 

For example, suppose we want to match strings containing b followed by 
a dollar sign. The regular expression for this pattern is b\$. If we want to 
create a string to represent this regular expression, we must add one more 
backslash: "b\\$". The two regular expressions on each of the following 
lines are equivalent: 

X - "b\\$" X - /b\$/ 

X - "b\$" X ~ /b$/ 

X - "b$" X - /b$/ 

X - "\\t" X - /\t/ 



The precise form of regular expressions and the substrings they match is 
given in Figure 4-4, The unary operators and ? have the highest pre- 
cedence, with concatenation next, and then alternation | . All operators are 
left associative, r stands for any regular expression. 
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Expression 


Matches 


c 


any non-metacharacter c 


\c 


character c literally 


A 


beginning of string 


$ 


end of string 


• 


any character but newline 




any character in set s 




any character not in set s 


r* 


zero or more fs 


r+ 


one or more fs 


r? 


zero or one r 


W 


r 


r^r2 


Vi then r2 (concatenation) 


ri\r2 


Vi or r2 (alternation) 



Figure 4-4: awk Regular Expressions 



Combinations of Patterns 

A compound pattern combines simpler patterns with parentheses and the 
logical operators 1 1 (or), && (and), and ! (not). For example, suppose we 
want to print all countries in Asia with a population of more than 500 million. 
The following program does this by selecting all lines in which the fourth field 
is Asia and the third field exceeds 500: 

$4 == "Asia" &Sl $3 > 500 

The program 

$4 == "Asia" II $4 == "Africa" 

selects lines with Asia or Africa as the fourth field. Another way to write 
the latter query is to use a regular expression with the alternation operator j : 

$4 /'^ (Asia {Africa) $/ 
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The negation operator ! has the highest precedence, then and finally 
1 1 , The operators && and 1 1 evaluate their operands from left to rights- 
evaluation stops as soon as truth or falsehood is determined. 

Pattern Ranges 

A pattern range consists of two patterns separated by a comma, as in 

In this case, the action is performed for each line between an occurrence of 
pat I and the next occurrence of pat 2 (inclusive). As an example, the pattern 

/Canada/, /Brazil/ 

matches lines starting with the first line that contains the string Canada, up 
through the next occurrence of the string Brazil: 



Canada 


3852 


24 


North America 


China 


3692 


866 


Asia 


USA 


3615 


219 


North America 


Brazil 


3286 


116 


South America 



Similarly, since FNR is the number of the current record in the current input 
file (and FILENAME is the name of the current input file), the program 

5tIR ~ 1, Fm == 5 { print FILENAME, $0 } 

prints the first five records of each input file with the name of the current 
input file prepended. 
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In a pattern-action statement, the action determines what is to be done 
with the input records that the pattern selects. Actions frequently are simple 
printing or assignment statements, but they may also be a combination of one 
or more statements. This section describes the statements that can make up 
actions. 



Built-in Variables 

Figure 4-5 lists the built-in variables that awk maintains. Some of these 



have already met; others are used in this and later sections. 




Variable 


Meaning 


Default 


ARGC 


number of command-line arguments 




ARGV 


array of command-line arguments 




FILENAME 


name of current input file 




FNR 


record number in current file 




FS 


input field separator 


blank&tab 


NF 


number of fields in current record 




NR 


number of records read so far 




OFMT 


output format for numbers 


%.6g 


OFS 


output field separator 


blank 


ORS 


output record separator 


newline 


RS 


input record separator 


newline 


RSTART 


index of first character matched by matchO 




RLENGTH 


length of string matched by matchO 




SUBSEF 


subscript separator 


" \034 " 



Figure 4-5: awk Built-in Variables 



Arithmetic 

Actions can use conventional arithmetic expressions to compute numeric 
values. As a simple example, suppose we want to print the population den- 
sity for each country in the file countries. Since the second field is the area in 
thousands of square miles, and the third field is the population in millions. 
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the expression 1000 * $3 / $2 gives the population density in people per 
square mile. The program 

{ printf "%10s %6.1f\n", $1, 1000 * $3 / $2 } 

applied to the file countries prints the name of each country and its popula- 
tion density: 

USSR 30.3 
Canada 6.2 



China 


234.6 


USA 


60.6 


Brazil 


35.3 


Australia 


4.7 


India 


502.0 


Argentina 


24.3 


Sudan 


19.6 


Alg^ia 


19.6 




Arithmetic is done internally in floating point. The arithmetic operators 
are % (remainder) and ^ (exponentiation; ** is a synonym). Arith- 

metic expressions can be created by applying these operators to constants, 
variables, field names, array elements, functions, and other expressions, all of 
which are discussed later. Note that awk recognizes and produces scientific 
(exponential) notation: 1e6, 1E6, 10e5, and 1000000 are numerically equal. 

awk has assignment statements like those found in the C programming 
language. The simplest form is the assignment statement 

I? = e 

where is a variable or field name, and e is an expression. For example, to 
compute the number of Asian countries and their total population, we could 
write 

$4 == »»Asia" { pop = pop + $3; n = n + 1 } 
END { print "population of", n, 

"Asian cx>untries in ndllions is", pop } 
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Applied to countries, this program produces 

population of 3 Asian countries in millions is 1765 

The action associated with the pattern $4 == "Asia" contains two assignment 
statements, one to accumulate population and the other to count countries. 
The variables are not explicitly initialized, yet everything works properly 
because awk initializes each variable with the string value " " and the 
numeric value 0. 

The assignments in the previous program can be written more concisely 
using the operators += and ++ as follows: 

$4 == "Asia" { pop += $3; ++n } 

The operator += is borrowed from the C programming language: 

pop += $3 
It has the same effect as 

pop = JDOp + $3 

but the += operator is shorter and runs faster. The same is true of the ++ 
operator, which adds one to a variable. 

The abbreviated assignment operators are +=, -=, *=, /=, %=, and '^=. 
Their meanings are similar: 

V op= e 

has the same effect as 

V = V op e. 

The increment operators are ++ and As in C, they may be used as pre- 
fix (++X) or postfix (x++) operators. If x is 1, then i=++x increments x, then 
sets i to 2, while i=x++ sets i to 1, then increments x. An analogous 
interpretation applies to prefix and postfix 

Assignment and increment and decrement operators may all be used in 
arithmetic expressions. 

We use default initialization to advantage in the following program, which 
finds the country with the largest population: 

nexpop < $3 { wxpcfp = $3; ccuntry = $1 } 
EMD { print country, naaqpop } 
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Note, however, that this program would not be correct if all values of $3 were 
negative. 

awk provides the built-in arithmetic functions shown in Figure 4-6. 



Function 


Value Returned 


ataii2(y,x) 


arctangent of y/x in the range -tt to w 


cosix) 


cosine of x, with x in radians 


expix) 


exponential function of x 


int(:c) 


integer part of x truncated towards 


log(x) 


natural logarithm of x 


randO 


random number between and 1 


sin(x) 


sine of x, with x in radians 


sqtUx) 


square root of x 


srand(x) 


X is new seed for rand() 



Figure 4-6: awk Built-in Arithmetic Functions 



X and y are arbitrary expressions. The function randO returns a pseudo- 
random floating point number in the range (0,1)/ and srand(x) can be used to 
set the seed of the generator. If srandO has no argument, the seed is derived 
from the time of day. 



Strings and String Functions 

A string constant is created by enclosing a sequence of characters inside 
quotation marks, as in "abc" or "hello, everyone". String constants may 
contain the C programming language escape sequences for special characters 
listed in "Regular Expressions" in this chapter. 

String expressions are created by concatenating constants, variables, field 
names, array elements, functions, and other expressions. The program 

{ print NR ":" $0 } 

prints each record preceded by its record number and a colon, with no blanks. 
The three strings representing the record number, the colon, and the record 
are concatenated, and the resulting string is printed. The concatenation opera- 
tor has no explicit representation other than juxtaposition. 
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awk provides the built-in string functions shown in Figure 4-7. In this 
table, r represents a regular expression (either as a string or as jr/), s and f 
string expressions, and n and p integers. 



Function 


Description 


gsub(r, s) 


substitutes s for r globally in current record. 




returns number of substitutions 


gsub(r, s, t) 


substitutes s for r globally in string t, 




returns number of substitutions 


index(s, t) 


returns position of string f in s, if not present 


length(s) 


returns length of s 


match(s, r) 


returns the position in s where r occurs, if not present 


split(s, a) 


splits s into array a on FS, returns number of fields 


split(s, a, r) 


splits s into array a on r, returns number of fields 


sprintf(/mf, expr-list) 


returns expr-list formatted according to format 




string fmt 


sub(r, s) 


substitutes s for first r in current record, returns 




number of substitutions 


sub(n s, t) 


substitutes s for first r in t, returns number of 




substitutions 


substr(s, p) 


returns suffix of $ starting at position p 


substr(s, p, n) 


returns substring of s of length n starting at 




position p 



Figure 4-7: awk Built-in String Functions 



The functions sub and gsub are patterned after the substitute command in 
the text editor ed(l), which can be found in the User's /System Administrator's 
Reference Manual The function gsub(r,s, t) replaces successive occurrences of 
substrings matched by the regular expression r with the replacement string s 
in the target string t, (As in ed, the leftmost match is used, and is made as 
long as possible.) It returns the number of substitutions made. The function 
gsub(r,s) is a synonym for gsub(r, s, $0), For example, the program 

{ gsub(/USA/, "Ifiiited States"); print } 

transcribes its input, replacing occurrences of USA by United States. The sub 
functions are similar, except that they only replace the first matching substring 
in the target string. 
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The function index(s, t) returns the leftmost position where the string t 
begins in s, or zero if t does not occur in s. The first character in a string is at 
position 1. For example, 

iiKiexC "baiiana" , "an") 

returns 2. 

The length function returns the number of characters in its argument 
string; thus, 

{ print length($0), $0 } 

prints each record, preceded by its length. ($0 does not include the input 
record separator.) The program 

length($1) > max { max = length($1); name = $1 } 
END { print name } 

applied to the file countries prints the longest country name: Australia. 

The match(s, r) function returns the position in string s where regular 
expression r occurs, or if it does not occur. This function also sets two 
built-in variables RSTART and RLENGTH. RSTART is set to the starting 
position of the match in the string; this is the same value as the returned 
value. RLENGTH is set to the length of the matched string. (If a match does 
not occur, RSTART is 0, and RLENGTH is -1.) For example, the following 
program finds the first occurrence of the letter i, followed by at most one 
character, followed by the letter a in a record: 

{ if (match($0, /i.?a/)) 

print RSTAKT, RLENGIH, $0 } 

It produces the following output on the file countries: 
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17 2 USSR 


8650 


262 


Asia 


26 3 Canada 


3852 


24 


North America 


3 3 Chiiia 


3692 


866 


Asia 


24 3 USA 


3615 


219 


North America 


27 3 Brazil 


3286 


116 


South America 


8 2 Australia 


2968 


14 


Australia 


4 2 Iiidia 


1269 


637 


Asia 


7 3 Argentiiia 


1072 


26 


South America 


17 3 Sudan 


968 


19 


Africa 


6 2 Algeria 


920 


18 


Africa 



NOTE 



match( ) matches the left-most longest matching string. For example, with 
the record 

I AsiaaaAsiaaaaan 

as input, the program 

{ if (inatch($0, /a+/)) print RSTffllT, PLENSIH, $0 } 

matches the first string of a's and sets RSTART to 4 and RLEHaiH to 3. 

The function sprintf (/brm(Zf, expr^, expr2, . . . / expr„) returns (without 
printing) a string containing expvi, expri, . . expr^ formatted according to 
the printf specifications in the string format, "The printf Statement" in this 
chapter contains a complete specification of the format conventions. The 
statement 

X = sprintf("%10s %6d", $1, $2) 

assigns to x the string produced by formatting the values of $1 and $2 as a 
ten-character string and a decimal number in a field of width at least six; x 
may be used in any subsequent computation. 
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The function substr(s,p, n) returns the substring of s that begins at posi- 
tion p and is at most n characters long. If substr(s,p) is used, the substring 
goes to the end of $; that is, it consists of the suffix of $ beginning at position 
p. For example, we could abbreviate the country names in countries to their 
first three characters by invoking the program 

{ $1 = substr($1, 1, 3); print } 

on this file to produce 



USS 8650 262 Asia 

Can 3852 24 North America 

Chi 3692 866 Asia 

USA 3615 219 North America 

Bra 3286 116 Scuth America 

Aus 2968 14 Australia 

iDd 1269 637 Asia 

Arg 1072 26 South America 

Sud 968 19 Africa 

Alg 920 18 Africa 



Note that setting $1 in the program forces awk to recompute $0 and, there- 
fore, the fields are separated by blanks (the default value of OFS), not by tabs. 

Strings are stuck together (concatenated) merely by writing them one after 
another in an expression. For example, when invoked on file countries, 

{ s = s substr($1, 1, 3) " " } 
END { print s } 

prints 

USS Can Chi USA Bra Aas Ind Arg Sud Alg 
by building s up, a piece at a time, from an initially empty string. 



awk 4-27 



Actions 



Field Variables 

The fields of the current record can be referred to by the field variables $1, 
$2, . . . , $NF. Field variables share all of the properties of other variables: 
they may be used in arithmetic or string operations, and they may have 
values assigned to them. So, for example, you can divide the second field of 
the file countries by 1000 to convert the area from thousands to millions of 
square miles 

{ $2 /= 1000; print } 

or assign a new string to a field: 

BEX3IN { FS = OFS = "\t" } 

$4 == "North America" { $4 = "NA" } 
$4 ~ "South America" { $4 = "SA" } 

{ print } 

The BEGIN action in this program resets the input field separator FS and the 
output field separator OPS to a tab. Notice that the print in the fourth line of 
the program prints the value of $0 after it has been modified by previous 
assignments. 

Fields can be accessed by expressions. For example, $(NF-1) is the second 
to last field of the current record. The parentheses are needed to show that 
the value of $NF-1 is 1 less than the value in the last field. 

A field variable referring to a nonexistent field, for example, $(NF+1), has 
as its initial value the empty string. A new field can be created, however, by 
assigning a value to it. For example, the following program invoked on the 
file countries creates a fifth field giving the population density: 

BEGIN { FS = OFS = "\t" } 

{ $5 = 1000 * $3 / $2; print } 

The number of fields can vary from record to record, but there is usually 
an implementation limit of 100 fields per record. 
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Number or String? 

Variables, fields and expressions can have both a numeric value and a 
string value. They take on numeric or string values according to context. For 
example, in the context of an arithmetic expression like 

pop += $3 

pop and $3 must be treated numerically, so their values will be coerced to 
numeric type if necessary. 

In a string context like 

print $1 $2 

$1 and $2 must be strings to be concatenated, so they will be coerced if neces- 
sary. 

In an assignment v = e or v op = e, the type of v becomes the type of e. 
In an ambiguous context like 

$1 == $2 

the type of the comparison depends on whether the fields are numeric or 
string, and this can only be determined when the program runs; it may well 
differ from record to record. 

In comparisons, if both operands are numeric, the comparison is numeric; 
otherwise, operands are coerced to strings, and the comparison is made on the 
string values. All field variables are of type string; in addition, each field that 
contains only a number is also considered numeric. This determination is 
done at run time. For example, the comparison "$1 == $2" will succeed on 
any pair of the inputs 

1 1.0 +1 0.1e+1 10E-1 001 

but fail on the inputs 

(null) 

(null) 0,0 

Oa 

1e50 I.OeSO 
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There are two idioms for coercing an expression of one type to the other: 

number " " concatenate a null string to a number to coerce it 
to type string 

string + add zero to a string to coerce it to type numeric 
Thus, to force a string comparison between two fields, say 

$1 "n = $2 

The numeric value of a string is the value of any prefix of the string that 
looks numeric; thus the value of 12.34x is 12.34, while the value of xl2.34 is 
zero. The string value of an arithmetic expression is computed by formatting 
the string with the output format conversion OFMT. 

Uninitialized variables have numeric value and string value " " . Nonex- 
istent fields and fields that are explicitly null have only the string value " "; 
they are not numeric. 



Control Flow Statements 

awk provides if-else, while, do-while, and for statements, and statement 
grouping with braces, as in the C programming language. 

The if statement syntax is 

if (expression) statement else statement 2 

The expression acting as the conditional has no restrictions; it can include the 
relational operators <=, >, >=, ==, and !=; the regular expression match- 
ing operators " and !"- ; the logical operators 1 1, &&, and !; juxtaposition for 
concatenation; and parentheses for grouping. 

In the if statement, the expression is first evaluated. If it is non-zero and 
non-null, statement i is executed; otherwise statement 2 is executed. The else 
part is optional. 

A single statement can always be replaced by a statement list enclosed in 
braces. The statements in the statement list are terminated by newlines or 
semicolons. 
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Rewriting the maximum population program from "Arithmetic Functions" 
with an if statement results in 



{ if (maaqppp < $3) { 
naxpop = $3 
country = $1 

} 

} 

END { parint ocuntry, maxpop } 



The while statement is exactly that of the C programming language: 

while (expression) statement 

The expression is evaluated; if it is non-zero and non-null, the statement is exe- 
cuted, and the expression is tested again. The cycle repeats as long as the 
expression is non-zero. For example, to print all input fields one per line. 



i = 1 

(i <= NP) { 
print $i 
i++ 

} 



The for statement is like that of the C programming language: 
for (expression i) expression; expression 2) statement 
It has the same effect as 
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expression I 
while (expression) { 




statement 
expression 2 





SO 



{ far (i = 1; i <= NF; i++) print $i } 

does the same job as the while example above. An alternate version of the 
for statement is described in the next section. 

The do statement has the form 

do statement while (expression) 

The statement is executed repeatedly until the value of the expression becomes 
zero. Because the test takes place after the execution of the statement (at the 
bottom of the loop), it is always executed at least once. As a result, the do 
statement is used much less often than while or for, which test for completion 
at the top of the loop. 

The following example of a do statement prints all lines except those 
between start and stop. 





/start/ { 



do { 

getline x 
} while (X r /stop/) 



{ print } 
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The break statement causes an immediate exit from an enclosing while or 
for; the continue statement causes the next iteration to begin. The next state- 
ment causes awk to skip immediately to the next record and begin matching 
patterns starting from the first pattern-action statement. 

The exit statement causes the program to behave as if the end of the 
input had occurred; no more input is read, and the END action, if any, is exe- 
cuted. Within the END action, 

exit expr 

causes the program to return the value of expr as its exit status. If there is no 
expr, the exit status is zero. 

Arrays 

awk provides one-dimensional arrays. Arrays and array elements need 
not be declared; like variables, they spring into existence by being mentioned. 
An array subscript may be a number or a string. 

As an example of a conventional numeric subscript, the statement 

x[NR] = $0 

assigns the current input line to the NRth element of the array x . In fact, it is 
possible in principle (though perhaps slow) to read the entire input into an 
array with the awk program 

{ x[NR] = $0 } 
END { . . . processing . . . } 

The first action merely records each input line in the array x, indexed by line 
number; processing is done in the END statement. 

Array elements may also be named by nonnumeric values. For example, 
the following program accumulates the total population of Asia and Africa 
into the associative array pop. The END action prints the total population of 
these two continents. 
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/Asia/ 

/Africa/ 

END 



{ pop["Asia"] += $3 } 
{ pap[ "Africa"] += $3 } 

{ print "Asian pofpulatian in millians is", ppp["Asia"] 



print "African population in ndllioois is", 



pop[ "Africa"] } 




On the file countries, this program generates 

Asian population in millions is 1765 
African pcpulation in millions is 37 

In this program if we had used pop[Asia] instead of pcp["Asia"], the expres- 
sion would have used the value of the variable Asia as the subscript, and 
since the variable is uninitialized, the values would have been accumulated in 
pqp[""] . 

Suppose our task is to determine the total area in each continent of the 
file countries. Any expression can be used as a subscript in an array refer- 
ence. Thus, 

area[$4] += $2 

uses the string in the fourth field of the current input record to index the array 
area and in that entry accumulates the value of the second field: 

BEGIN { FS = "\t" } 

{ area[$4] += $2 } 
END { for (name in area) 

print name, area[name] } 

Invoked on the file countries, this program produces 
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Africa 1888 
Itorth America 7467 
South America 4358 
Asia 13611 
Australia 2968 



This program uses a form of the for statement that iterates over all 
defined subscripts of an array: 

for (f in array) statement 

executes statement with the variable i set in turn to each value of i for which 
array[i] has been defined. The loop is executed once for each defined sub- 
script, which are chosen in a random order. Results are unpredictable when i 
or array is altered during the loop. 

awk does not provide multi-dimensional arrays, but it does permit a list of 
subscripts. They are combined into a single subscript with the values 
separated by an unlikely string (stored in the variable SUBSEP). For example, 

for (i = 1; i <= 10; i++) 



creates an array which behaves like a two-dimensional array; the subscript is 
the concatenation of i, SUBSEP, and 

You can determine whether a particular subscript i occurs in an array arr 
by testing the condition i in arr, as in 

if ("Africa" in area) 

This condition performs the test without the side effect of creating 
area[ "Africa"], which would happen if we used 

if (area ["Africa"] != "") ... 

Note that neither is a test of whether the array area contains an element with 
value "Africa" . 





for (j = 1; j <= 10; j++) 
arr[i,j] = ... 



awk 
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It is also possible to split any string into fields in the elements of an array 
using the built-in function split. The function 

split("s1:s2:s3", a, ":") 

splits the string s1:s2:s3 into three fields, using the separator : , and stores 
s1 in a[1], s2 in a[2], and s3 in a[3] . The number of fields found, here 
three, is returned as the value of split. The third argument of split is a regu- 
lar expression to be used as the field separator. If the third argument is miss- 
ing, FS is used as the field separator. 

An array element may be deleted with the delete statement: 

delete arrayname[subscript] 



User-Defined Functions 

awk provides user-defined functions. A function is defined as 

function name (argument-list) { 
statements 

} 

The definition can occur anywhere a pattern-action statement can. The argu- 
ment list is a list of variable names separated by commas; within the body of 
the function, these variables refer to the actual parameters when the function 
is called. There must be no space between the function name and the left 
parenthesis of the argument list when the function is called; otherwise it looks 
like a concatenation. For example, the following program defines and tests 
the usual recursive factorial function (of course, using some input other than 
the file countries): 
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func±ian fact(n) { 
if (n <= 1) 
return 1 

else 



return n * fact{n-1) 



{ print $1 "I is " fact($1) } 





Array argumerits are passed by reference, as in C, so it is possible for the 
function to alter array elements or create new ones. Scalar arguments are 
passed by value, however, so the function cannot affect their values outside. 
Within a function, formal parameters are local variables, but all other variables 
are global. (You can have any number of extra formal parameters that are 
used purely as local variables.) The return statement is optional, but the 
returned value is undefined if it is not included. 



Comments may be placed in awk programs: they begin with the charac- 
ter # and end at the end of the line, as in 



Statements in an awk program normally occupy a single line. Several 
statements may occur on a single line if they are separated by semicolons. A 
long statement may be continued over several lines by terminating each con- 
tinued line by a backslash. (It is not possible to continue a string.) This 
explicit continuation is rarely necessary, however, since statements continue 
automatically after the operators && and 1 1 or if the line ends with a comma 
(for example, as might occur in a print or printf statement). 

Several pattern-action statements may appear on a single line if separated 
by semicolons. 



Some Lexical Conventions 



print X, y 



# this is a ooninent 
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The print and printf statements are the two primary constructs that gen- 
erate output. The print statement is used to generate simple output; printf is 
used for more carefully formatted output. Like the shell, awk lets you redirect 
output so that output from print and printf can be directed to files and pipes. 
This section describes the use of these two statements. 



The print Statement 

The statement 

print expri, expti, . . ./ expr^ 

prints the string value of each expression separated by the output field separa- 
tor followed by the output record separator. The statement 

pxrint 
is an abbreviation for 

print $0 
To print an empty line use 

print 

Output Separators 

The output field separator and record separator are held in the built-in 
variables OFS and ORS. Initially, OFS is set to a single blank and ORS to a 
single newline, but these values can be changed at any time. For example, the 
following program prints the first and second fields of each record with a 
colon between the fields and two newlines after the second field: 

BEGIN { OFS = ORS = "\n\n" } 
{ print $1, $2 } 

Notice that 

{ print $1 $2 } 

prints the first and second fields with no intervening output field separator, 
because $1 $2 is a string consisting of the concatenation of the first two fields. 
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The printf Statement 

awk's printf statement is the same as that in C except that the * fonnat 
specifier is not supported. The printf statement has the general form 

printf format, expry, expr2, . . . / expr„ 

where format is a string that contains both information to be printed and 
specifications on what conversions are to be performed on the expressions in 
the argument list, as in Figure 4-8. Each specification begins with a %^ ends 
with a letter that determines the conversion, and may include 

left-justify expression in its field 
width pad field to this width as needed; fields that begin 

with a leading are padded with zeros 
.prec maximum string width or digits to right of 

decimal point 



Character 


Prints Expression as 


c 


single character 


d 


decimal number 


e 


[-]d.ddddddE[+-]dd 


f 


[-]ddd.dddddd 


g 


e or f conversion, whichever is shorter, with 


nonsignificant zeros suppressed 


o 


unsigned octal number 


s 


string 


X 


unsigned hexadecimal number 


% 


print a %; no argument is converted 



Figure 4-8: awk printf Conversion Characters 
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Here are some examples of printf statements with the corresponding out- 
put: 



prdntf 


•'%d", 99/2 


49 


printf 


"5fe", 99/2 


4.950000e+01 


parintf 


"%f", 99/2 


49.500000 


parintf 


"%6.2f", 99/2 


49.50 


printf 


••«g", 99/2 


49.5 


printf 


"9fo", 99 


143 


printf 


"9«)6o", 99 


000143 


printf 


"9ejc", 99 


63 


printf 


"January" 


IJanuaryl 


prdntf 


••|5610s|", "January" 


1 January) 


printf 


"|56-10s|", "January" 


1 January | 


printf 


"|%.3s|", "January" 


|Jan| 


printf 


"|%10.3s|", "January" 


1 Jan| 


printf 


"|%-10.3s|", "January" 


|Jan 1 


printf 




% 



The default output format of numbers is %.6g; this can be changed by assign- 
ing a new value to OFMT. OFMT also controls the conversion of numeric 
values to strings for concatenation and creation of array subscripts. 



Output into Files 

It is possible to print output into files instead of to the standard output, by 
using the > and » redirection operators. For example, the following pro- 
gram invoked on the file countries prints all lines where the population (third 
field) is bigger than 100 into a file called bigpop, and all other lines into 
smallix>p: 

$3 > 100 { print $1, $3 >"bigpofp" } 
$3 <= 100 { print $1, $3 >"smallpqp" } 

Notice that the file names have to be quoted; without quotes, bigpop and 
srnallpop are merely uninitialized variables. If the output file names were 
created by an expression, they would also have to be enclosed in parentheses: 

$4 /North America/ { print $1 > ("tirp" FILENAME) } 
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This is because the > operator has higher precedence than concatenation; 
without parentheses, the concatenation of trtp and FILENAME would not work. 



NOTE 



Files are opened once in an awk program. If > is used to open a file, its 
original contents are overwritten. But if » is used to open a file, its 
contents are preserved and the output is appended to the file. Once the 
file has been opened, the two operators have the same effect. 



Output into Pipes 

It is also possible to direct printing into a pipe with a command on the 
other end, instead of into a file. The statement 

print I ^^command'line" 

causes the output of print to be piped into the command-line. 

Although we have shown them here as literal strings enclosed in quotes, 
the command-line and file names can come from variables, and the return 
values from functions, for instance. 

Suppose we want to create a list of continent-population pairs, sorted 
alphabetically by continent. The awk program below accumulates the popula- 
tion values in the third field for each of the distinct continent names in the 
fourth field in an array called pop. Then it prints each continent and its popu- 
lation, and pipes this output into the sort command. 

BEGIN { FS = "\t" } 

{ PPP[$4] += $3 } 
END { far (c in pop) 

print c ":" pop[c] | "sort" } 

Invoked on the file countries, this program yields 
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Africa; 37 
Asia: 1765 
Australia: 14 
North AiDsrica:243 
Scfuth America: 142 




In all of these print statements involving redirection of output, the files or 
pipes are identified by their names (that is, the pipe above is literally named 
sort ), but they are created and opened only once in the entire run. So, in the 
last example, for all c in pop, only one sort pipe is open. 

There is a limit to the number of files that can be open simultaneously. 
The statement close (/fie) closes a file or pipe; file is the string used to create it 
in the first place, as in 

close ("sort") 

When opening or closing a file, different strings are different commands. 
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The most common way to give input to an awk program is to name on 
the command line the file(s) that contains the input. This is the method we've 
been using in this chapter. However, there are several other methods we 
could use, each of which this section describes. 



Files and Pipes 

You can provide input to an awk program by putting the input data into a 
file, say awkdata, and then executing 

awk 'program' awkdata 

awk reads its standard input if no file names are given (see " Usage " in this 
chapter); thus, a second common arrangement is to have another program 
pipe its output into awk. For example, egrep(l), in the User's /System 
Administrator's Reference Manual, selects input lines containing a specified reg- 
ular expression, but it can do so faster than awk, since this is the only thing it 
does. We could, therefore, invoke the pipe 

egrep 'Asia' countries | awk '. . . ' 

egrep quickly finds the lines containing Asia and passes them on to the awk 
program for subsequent processing. 

Input Separators 

With the default setting of the field separator FS, input fields are 
separated by blanks or tabs, and leading blanks are discarded, so each of these 
lines has the same first field: 

fieldl field2 
fieldl 
fieldl 

When the field separator is a tab, however, leading blanks are not discarded. 



awk 4-43 



Input 

The field separator can be set to any regular expression by assigning a 
value to the built-in variable FS. For example, 

BE3GIN { FS = "(,[ \\t]*)|([ \\t] + )" } 

sets it to an optional comma followed by any number of blanks and tabs. FS 
can also be set on the command line with the -F argument: 

awk -FX[\tl*)|(I \t]+r'../ 

behaves the same as the previous example. Regular expressions used as field 
separators match the left-most longest occurrences (as in subO)/ but do not 
match null strings. 



Multi-line Records 

Records are normally separated by newlines, so that each line is a record, 
but this too can be changed, though only in a limited way. If the built-in 
record separator variable RS is set to the empty string, as in 

BEGIN { RS = "" } 

then input records can be several lines long; a sequence of empty lines 
separates records. A common way to process multiple-line records is to use 

BE3GIN { RS = ""; FS = "\n" } 

to set the record separator to an empty line and the field separator to a new- 
line. There is a limit, however, on how long a record can be; it is usually 
about 2500 characters. "The getline Function" and "Cooperation with the 
Shell " in this chapter show other examples of processing multi-line records. 



The getline Function 

awk's facility for automatically breaking its input into records that are 
more than one line long is not adequate for some tasks. For example, if 
records are not separated by blank lines, but by something more complicated, 
merely setting RS to null doesn't work. In such cases, it is necessary to 
manage the splitting of each record into fields in the program. Here are some 
suggestions. 
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The function getline can be used to read input either from the current 
input or from a file or pipe, by using redirection in a manner analogous to 
printf. By itself, getline fetches the next input record and performs the nor- 
mal field-splitting operations on it. It sets NF, NR, and FNR. getline returns 
1 if there was a record present, if the end-of-file was encountered, and -1 if 
some error occurred (such as failure to open a file). 

To illustrate, suppose we have input data consisting of multi-line records, 
each of which begins with a line beginning with START and ends with a line 
beginning with STOP. The following awk program processes these multi-line 
records, a line at a time, putting the lines of the record into consecutive entries 
of an array 

f[1] f[2] ... f[nf] 

Once the line containing STOP is encountered, the record can be processed 
from the data in the f array: 



/-^START/ { 

f[nf=1] = $0 

vrtiile (getline 6& $0 l" /*STOP/) 

f [++nf ] = $0 
# new picocess the data in f [1]. . .f [nf ] 



} 



Notice that this code uses the fact that && evaluates its operands left to right 
and stops as soon as one is true. 
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The same job can also be done by the following program: 



/'^START/ nf==0 { f [nf=1] = $0 } 

nf > 1 { f [++nf] = $0 } 

/'^STOP/ { # now process the data in f [ 1] . . .f [nf ] 

nf = 



The statement 
getline x 

reads the next record into the variable x. No splitting is done; NF is not set. 
The statement 

getline <"file" 

reads from file instead of the current input. It has no effect on NR or FNR, 
but field splitting is performed, and NF is set. The statement 
getline x <"file" 

gets the next record from file into x; no splitting is done, and NF, NR and 
FNR are untouched. 



NOTE 



If a filename is an expression, it should be in parentheses for evaluation: 

vMle ( getline x < (ARGV[1] ARGV[2]) ) { ... } 

This is because the < has precedence over concatenation. Without 
parentheses, a statement such as 

getline x < "tinp" FILENAME 

sets X to read the file tnp and not tup <value of FILENAME>. Also, if you 
use this getline statement form, a statement like 

while { getline x < file ){...} 

loops forever if the file cannot be read, because getline returns -1, not 
zero, if an error occurs. A better way to write this test is 
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v*iile ( getline x < file > 0) { . . . } 

It is also possible to pipe the output of another command directly into get- 
line. For example, the statement 

vtole ("who" I getline) 
n++ 

executes v*io and pipes its output into getline. Each iteration of the vAdle 
loop reads one more line and increments the variable n, so after the v)hile 
loop terminates, n contains a count of the number of users. Similarly, the 
statement 

"date" I getline d 

pipes the output of date into the variable d, thus setting d to the current 
date. Figure 4-9 summarizes the getline function. 



Form 


Sets 


getline 


$0, NF, NR, FNR 


getline var 


var, NR, FNR 


getline <file 


$0,NF 


getline var <file 


var 


cmd 1 getline 


$0, NF 


cmd 1 getline var 


var 



Figure 4-9: getline Function 



Command-line Arguments 

The command-line arguments are available to an awk program: the array 
ARGV contains the elements ARGV[0], . . . , ARGV[ARGC-1]; as in C, 
ARGC is the count. ARGV[0] is the name of the program (generally awk); 
the remaining arguments are v^^hatever was provided (excluding the program 
and any optional arguments). 
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The following command line contains an awk program that echoes the argu- 
ments that appear after the program name: 




BEGm { 

far (i = 1; i < ARGC; i++) 

printf "%s AKGV[i] 
printf *'\n" 




The arguments may be modified or added to; ARGC may be altered. As each 
input file ends, awk treats the next non-null element of ARGV (up to the 
current value of ARGC-1) as the name of the next input file. 

There is one exception to the rule that an argument is a file name: if it is 
of the form 

var^value 

then the variable var is set to the value value as if by assignment. Such an 
argument is not treated as a file name. If value is a string, no quotes are 
needed. 
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tlie Shell 

awk gains its greatest power when it is used in conjunction with other 
programs. Here we describe some of the ways in which awk programs 
cooperate with other commands. 

The system Function 

The built-in function sysiemicommand-line) executes the command 
command-line, which may well be a string computed by, for example, the 
built-in function sprintf. The value returned by system is the return status of 
the command executed. 

For example, the program 

$1 == "#include" { gsub(/[<>"]/, $2); systCTi("cat " $2) } 

calls the command cat to print the file named in the second field of every 
input record whose first field is #iiiclude, after stripping any <, >, or " that 
might be present. 



Cooperation with the Shell 

In all the examples thus far, the awk program was in a file and fetched 
from there using the -f flag, or it appeared on the command line enclosed in 
single quotes, as in 

awk '{ print $1 }' . . . 

Since awk uses many of the same characters as the shell does, such as $ and 

surrounding the awk program with single quotes ensures that the shell will 
pass the entire program unchanged to the awk interpreter. 

Now, consider writing a command addr that will search a file addresslist 
for name, address, and telephone information. Suppose that addresslist con- 
tains names and addresses in which a typical entry is a multi-line record such 
as 
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6, R. Bmlin 

600 Mountain Avenue 

Murray Hill, NJ 07974 

201-555-1234 

Records are separated by a single blank line. 

We want to search the address list by issuing commands like 
addr Emlin 
That is easily done by a program of the form 

BEEJGIN { RS = } 

/anlin/ 

* addresslist 

The problem is how to get a different search pattern into the program each 
time it is run. 

There are several ways to do this. One way is to create a file called addr 
that contains 

BEGIN { RS = } 

/'$1V 

' addresslist 

The quotes are critical here: the awk program is only one argument, even 
though there are two sets of quotes, because quotes do not nest. The $1 is 
outside the quotes, visible to the shell, which therefore replaces it by the pat- 
tern anlin when the command addr Emlin is invoked. On a UNIX System, 
addr can be made executable by changing its mode with the following com- 
mand: chmod +x addr. 

A second way to implement addr relies on the fact that the shell substi- 
tutes for $ parameters within double quotes: 

awk " 

BEGIN { RS = \"\" } 

/$1/ 

" addresslist 

Here we must protect the quotes defirung RS with backslashes so that the 
shell passes them on to awk, uninterpreted by the shell. $1 is recognized as 



4-50 PROGRAMMER'S GUIDE 



Using awk with Other Commands and the Shell 



a parameter, however, so the shell replaces it by the pattern when the com- 
mand addr pattern is invoked. 

A third way to implement addr is to use ARGV to pass the regular 
expression to an awk program that explicitly reads through the address list 
with getline: 



BEGIN { RS = 

vdiile (getline < "addresslist" ) 
if ($0 - AR3V[1]) 
print $0 

> • $♦ 



All processing is done in the BEGIN action. 

Notice that any regular expression can be passed to addr; in particular, it 
is possible to retrieve by parts of an address or telephone number, as well as 
by name. 
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awk has been used in surprising ways. We have seen awk programs that 
implement database systems and a variety of compilers and assemblers, in 
addition to the more traditional tasks of information retrieval, data manipula- 
tion, and report generation. Invariably, the awk programs are significantly 
shorter than equivalent programs written in more conventional programming 
languages, such as Pascal or C. In this section, we will present a few more 
examples to illustrate some additional awk programs. 



awk is especially useful for producing reports that summarize and format 
information. Suppose we wish to produce a report from the file countries in 
which we list the continents alphabetically, and after each continent its coun- 
tries in decreasing order of population: 



Generating Reports 





Africa: 



Sudan 
Algeria 



19 
18 



Asia: 



China 
India 
USSR 



866 
637 
262 



Australia: 



Australia 



14 



North America: 



USA 
Canada 



219 
24 



South Arnerica: 



Brazil 
Argentina 



116 
26 
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As with many data processing tasks, it is much easier to produce this 
report in several stages. First we create a list of continent-country-population 
triples, in which each field is separated by a colon. This can be done with the 
following program, triples, which uses an array pop, indexed by subscripts of 
the form 'continentrcountry' to store the population of a given country. The 
print statement in the END section of the program creates the list of continent- 
country-population triples that are piped to the sort routine. 

BEGIN { FS = "\t" } 

{ pop[$4 ":" $1] += $3 } 
END { for (cc in pop) 

print cc ":" pqp[cc] | "sort -t: +0 -1 +2nr" } 

The arguments for sort deserve special mention. The argument tells 
sort to use : as its field separator. The +0 -1 arguments make the first field 
the primary sort key. In general, H -j makes fields /+I , /-I-2, . . . , / the sort 
key. If is omitted, the fields from to the end of the record are used. 
The +2nr argument makes the third field, numerically decreasing, the secon- 
dary sort key (n is for numeric, r for reverse order). Invoked on the file coun- 
tries, this program produces as output: 



Africa: Sudan: 19 
Africa: Algeria: 18 
Asia :China: 866 
Asia: India: 637 
Asia :IISSR: 262 
Australia : Australia : 14 
North Aiiierica:USA:219 
North America: Canada: 24 
Soutli AzDorica: Brazil: 116 
South America: Argentina: 26 



This output is in the right order but the wrong format. To transform the 
output into the desired form we run it through a second awk program, for- 
mat. 
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BEGIN { FS = } 

{ if ($1 1= prev) { 

print "Nn" $1 

prev = $1 

> 

printf "VtX-IOs XSdNii", $2, $3 



This is a control-break program that prints only the first occurrence of a con- 
tinent name and formats the country-population lines associated with that 
continent in the desired manner. The command line 

awk -f triples countries | awk -£ format 

gives us our desired report. As this example suggests, complex data transfor- 
mation and formatting tasks can often be reduced to a few simple awks and 
sorts. 

As an exercise, add to the population report subtotals for each continent 
and a grand total. 



Additional Examples 

Word Frequencies 

Our first example illustrates associative arrays for counting. Suppose we 
want to count the number of times each word appears in the input, where a 
word equals any contiguous sequence of non-blank, non-tab characters. The 
following program prints the word frequencies, sorted in decreasing order. 

{ for (w = 1; w <= NF; wf+) oaunt[$w]++ } 
END { for (w in oount) print cjount[w], w | "sort -nr" } 

The first statement uses the array count to accumulate the number of times 
each word is used. Once the input has been read, the second for loop pipes 
the final count, along with each word, into the sort command. 
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Accumulation 

Suppose we have two files, deposits and withdrawals, of records con- 
taining a name field and an amount field. For each name we want to print 
the net balance determined by subtracting the total withdrawals from the total 
deposits for each name. The net balance can be computed by the following 
program; 




FTLENAME == "deposits" { balance[$1] += $2 } 
EILEJWME == '*withdrawals" { balance[$1] -= $2 } 
END ( for (name in balance) 

print name, balance[name] 

} * deposits withdrawals 




The first statement uses the array balance to accumulate the total amount for 
each name in the file deposits. The second statement subtracts associated 
withdrawals from each total. If there are only withdrawals associated with a 
name, an entry for that name will be created by the second statement. The 
END action prints each name with its net balance. 

Random Choice 

The following function prints (in order) k random elements from the first 
n elements of the array A, In the program, k is the number of entries that 
still need to be printed, and n is the number of elements yet to be examined. 
The decision of whether to print the fth element is determined by the test 
randO < k/n. 



awk 4-55 



Example Applications 



function choose (A, k, n) { 

for (X = 1; n > 0; i++) 

if (randO < k/n— ) { 
print A[i] 
k~ 



} 



} 



Shell Facility 

The following awk program simulates (crudely) the history facility of the 
UNIX System shell. A line containing only = re-executes the last command 
executed. A line beginning with = cmd re-executes the last command whose 
invocation included the string cmd. Otherwise, the current line is executed. 



systera(x[NR] = x[NR-1]) 

else 

for (i = NR-1; i > 0; i— ) 
if (x[i] ~ $2) { 

system(x[NR] = x[i]) 
break 

} 

next } 



{ systen(x[NR] = $0) } 
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Form-letter Generation 

The following program generates form letters, using a template stored in a 
file called form. letter: 

This is a form letter. 

Ilie first field is $1, the seooind $2, the third $3. 
Ihe third is $3, second is $2, and first is $1. 

and replacement text of this form: 

field 1 1 field 2|field 3 

one I twD I three 

a|b|c 

The BEGIN action stores the template in the array tarplate; the remaining 
action cycles through the input data, using gsub to replace template fields of 
the form $n with the corresponding data fields. 




vdiile (getline < "form. letter") 
linfi[++n] = $0 

} 

{ for (i = 1; i <= n; i++) { 

s = liiie[i] 

for (j = 1; j <= NF; j++) 

gsub("\\$"j, $j, s) 

print s 

} 

} 




In all such examples, a prudent strategy is to start with a small version 
and expand it, trying out each aspect before moving on to the next. 
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Command Line 

awk program filenames 
awk -f program-file filenames 

awk -Fs sets field separator to string s; -Ft sets separator to tab 



Pattems 

BEGIN 
END 

/regular expression/ 
relational expression 
pattern && pattern 
pattern \ | pattern 
(pattern) 
Ipattern 

pattern, pattern 



Control Flow Statements 

if (expr) statement [else statement] 

if (subscript in array) statement [else statement] 

while (expr) statement 

for (expr; expr; expr) statement 

for (var in array) statement 

do statement while (expr) 

break 

continue 

next 

exit [expr] 
return [expr] 
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closeifilename) 
getline 
getline <file 
getline var 
getline var <file 
print 

print expr-list 
print expr-Hst >file 
printf ftnt, expr-list 
printf ftnt, expr-list >file 
systemicmd'line) 



dose file 

set $0 from next input record; set NF, NR, FNR 

set $0 from next record of file; set NF 

set var from next input record; set NR, FNR 

set var from next record of file 

print current record 

print expressions 

print expressions on file 

format and print 

format and print on file 

execute command cmd-line, return status 



In print and printf above, »file appends to the file, and | command 
writes on a pipe. Similarly, command \ getline pipes into getline. getline 
returns on end of file, and -1 on error. 



Functions 

func nameiparameter list) { statement } 
function nameiparameter list) { statement } 
function-nameiexpr, expr, . • .) 
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String Functions 

gsub(r,s,0 



index(s, 

leiigth(s) 
match(s,r) 

split(s, fl, r) 



spriiitf(/m^ expr-list) 
3ub{r,$,t) 
substr(5, n) 



substitute string s for each substring matching 
regular expression r in string t, return number 
of substitutions; if t omitted, use $0 
return index of string t in string s, or if not 
present 

return length of string s 

return position in s where regular expression r 

occurs, or if r is not present 

split string s into array a on regular expression 

r, return number of fields; if r omitted, FS is 

used in its place 

print expr-list according to fmt, return resulting 
string 

like gsub except only the first matching sub- 
string is replaced 

return n-char substring of s starting at i;if n 
omitted, use rest of s 



Arithmetic Functions 



atan2(i/,jc) 


arctangent of y /jc in radians 


cos(6JCpr) 


cosine (angle in radians) 


exp{expr) 


exponential 


int{expr) 


truncate to integer 


logiexpr) 


natural logarithm 


rand() 


random number between and 1 


sin(ejcpr) 


sine (angle in radians) 


sqti(expr) 


square root 


srand(expr) 


new seed for random number generatoi 




use time of day if no expr 
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Operators (Increasing Precedence) 



= += -= *= /= %= A= assignment 

?: conditional expression 

1 1 logical OR 

logical AND 

" ! *" regular expression match, negated match 

<<=>>=!=== relational 

blank string concatenation 

+ — add, subtract 

* / % multiply, divide, mod 

+ — ! unary plus, unary minus, logical negation 

^ exponentiation (** is a synonym) 

++ — increment, decrement (prefix and postfix) 

$ field 



Regular Expressions (Increasing Precedence) 



c 


matches non-metacharacter c 


V 


matches literal character c 




matches any character but newline 


A 


matches beginning of line or string 


$ 


matches end of line or string 


[abc.,.\ 


character class matches any of abc... 




negated class matches any but abc... and newline 


rl\r2 


matches either rl or r2 


rlrl 


concatenation: matches rl, then r2 


r+ 


matches one or more fs 


r* 


matches zero or more r's 


r? 


matches zero or one r's 


it) 


grouping: matches r 
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Built-in Variables 



ARGC number of command-line arguments 

ARGV array of command-line arguments (0..ARQC-1) 

FILENAME name of current input file 

FNR input record number in current file 

FS input field separator (default blank) 

NF number of fields in current input record 

NR input record number since beginning 

OFMT output format for numbers (default %.6g) 

OFS output field separator (default blank) 

ORS output record separator (default newline) 

RS input record separator (default newline) 

RSTART index of first character matched by matchO; if no match 

RLENGTH length of string matched by matchO; -1 if no match 

SUBSEF separates multiple subscripts in array elements; default "\034" 



Limits 

Any particular implementation of awk enforces some limits. Here are typ- 
ical values: 

100 fields 

2500 characters per input record 
2500 characters per output record 
1024 characters per individual field 
1024 characters per printf string 
400 characters maximum quoted string 
400 characters in character class 
15 open files 
1 pipe 

numbers are limited to what can be represented on the local 
machine, e.g., le-38..1e+38 
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Initialization, Comparison, and Type Coercion 

Each variable and field can potentially be a string or a number or both at 
any time. When a variable is set by the assignment 

var= expr 

its type is set to that of the expression. (Assignment includes ~, etc.) 
An arithmetic expression is of type number, a concatenation is of type string, 
and so on. If the assignment is a simple copy, as in 

v1 = v2 

then the type of v1 becomes that of v2. 

In comparisons, if both operands are numeric, the comparison is made 
numerically. Otherwise, operands are coerced to string if necessary, and the 
comparison is made on strings. The type of any expression can be coerced to 
numeric by subterfuges such as 

expr + 

and to string by 

expr 

(that is, concatenation with a null string). 

Uninitialized variables have the numeric value and the string value " " . 
Accordingly, if x is uninitialized, 

if (x) 

is false, and 

if (!x) ... 

if (x == 0) ... 

if (X == "") 

are all true. But the following is false: 
if (X == "0") ... 
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The type of a field is determined by context when possible; for example, 
$1++ 

clearly implies that $1 is to be numeric, and 
$1 = $1 "^«t $2 

implies that $1 and $2 are both to be strings. Coercion is done as needed. 
In contexts where types cannot be reliably determined, for example, 
if ($1 $2) ... 

the type of each field is detemuned on input. All fields are strings; in addi- 
tion, each field that contains only a number is also considered numeric. 

Fields that are explicitly null have the string value " »• ; they are not 
numeric. Non-existent fields (i.e., fields past NF) are treated this way, too. 

As it is for fields, so it is for array elements created by splitO. 

Mentioning a variable in an expression causes it to exist, with the value 
" " as described above. Thus, if arr[i] does not currently exist, 

if (arr[i] == ... 

causes it to exist with the value " " so the if is satisfied. The special construc- 
tion 

if (i in arr) . . . 

determines if arr[i] exists vdthout the side effect of creating it if it does not. 
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An Overview of lex Programming 



The software tool lex lets you solve a wide class of problems drawn from 
text processing, code enciphering, compiler writing, and other areas. In text 
processing, you may check the spelling of words for errors; in code encipher- 
ing, you may translate certain patterns of characters into others; and in com- 
piler writing, you may determine what the tokens (smallest meaningful 
sequences of characters) are in the program to be compiled. The problem 
common to all of these tasks is recognizing different strings of characters that 
satisfy certain characteristics. In the compiler writing case, creating the ability 
to solve the problem requires implementing the compiler's lexical analyzer; 
hence the name lex. 

It is not essential to use lex to handle problems of this kind. You could 
write programs in a standard language like C to handle them, too. In fact, 
what lex does is produce such C programs, (lex is therefore called a program 
generator.) What lex offers you, once you acquire a facility with it, is typically 
a faster, easier way to create programs that perform these tasks. Its weakness 
is that it often produces C programs that are longer than necessary for the 
task at hand and that execute more slowly than they otherwise might. In 
many applications this is a minor consideration, and the advantages of using 
lex considerably outweigh it. 

To understand what lex does, see the diagram in Figure 5-1. We begin 
with the lex source (often called the lex specification) that you, the program- 
mer, write to solve the problem at hand. This lex source consists of a list of 
rules specif3dng sequences of characters (expressions) to be searched for in an 
input text, and the actions to take when an expression is found. The source is 
read by the lex program generator. The output of the program generator is a 
C program that, in turn, must be compiled by a host language C compiler in 
order to generate the executable object program that does the lexical analysis. 
Note that this procedure is not typically automatic — user intervention is 
required. Finally, the lexical analyzer program produced by this process takes 
as input any source file and produces the desired output, such as altered text 
or a list of tokens. 
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lex can also be used to collect statistical data on features of the input, such 
as character count, word length, number of occurrences of a word, and so 
forth. In later sections of this chapter, we will see 

■ how to write lex source to do some of these tasks 

■ how to translate lex source 

■ how to compile, link, and execute the lexical analyzer in C 

■ how to run the lexical analyzer program 

We will then be on our way to appreciating the power that lex provides. 



lex 

Source 



lex 



lex 

Analyzer 
inC 



1 

C 

Compiler 




Figure 5-1: Creation and Use of a Lexical Analyzer with lex 
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A lex specification consists of at most three sections: definitions, rules, 
and user subroutines. The rules section is mandatory. Sections for definitions 
and user subroutines are optional, but if present, must appear in the indicated 
order. 



Tiie Fundamentals of lex Rules 

The mandatory rules section opens with the delimiter %%. If a subrou- 
tines section follows, another %% delimiter ends the rules section. If there is 
no second delimiter, the rules section is presumed to continue to the end of 
the program. 

Each rule consists of a specification of the pattern sought and the action(s) 
to take on finding it. (Note the dual meaning of the term specification — it 
may mean either the entire lex source itself or, within it, a representation of a 
particular pattern to be recognized.) Whenever the input consists of patterns 
not sought, lex writes out the input exactly as it finds it. So, the simplest lex 
program is just the beginning rules delimiter, %%. It writes out the entire 
input to the output with no changes at all. Typically, the rules are more ela- 
borate than that. 

Specifications 

You specify the patterns you are interested in with a notation called regu- 
lar expressions. A regular expression is formed by stringing together charac- 
ters with or without operators. The simplest regular expressions are strings of 
text characters with no operators at all. For example, 

apple 

orange 

pluto 

These three regular expressions match any occurrences of those character 
strings in an input text. If you want to have your lexical analyzer, a.out, 
remove every occurrence of orange from the input text, you could specify the 
rule 

orange; 
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Because you did not spedfy an action on the right (before the semicolon), 
lex does nothing but print out the original input text with every occurrence of 
this regular expression removed, that is, without any occurrence of the string 
orange at alL 

Unlike orange above, most of the expressions that we want to search for 
cannot be specified so easily. The expression itself might simply be too long. 
More commonly, the class of desired expressions is too large; it may, in fact, 
be infinite. Thanks to the use of operators, we can form regular expressions 
signifying any expression of a certain class. The + operator, for instance, 
means one or more occurrences of the preceding expression, the ? means or 
1 occurrence of the preceding expression (this is equivalent, of course, to say- 
ing that the preceding expression is optional), and * means or more 
occurrences of the preceding expression. (It may at first seem odd to speak of 
occurrences of an expression and to need an operator to capture the idea, 
but it is often quite helpful. We will see an example in a moment.) There- 
fore, m+ is a regular expression matching any string of ms such as each of the 
following: 

UUIU 

m 

IHIUIIUL 

mn 

In like manner, 7* is a regular expression matching any string of zero or more 
7s: 

77 

77777 
777 

The string of blanks on the third line matches simply because it has no 7s in it 
at all. 

Brackets, [ ], indicate any one character from the string of characters speci- 
fied between the brackets. Thus, [dgka] matches a single d, g, k, or a. Note 
that commas are not included within the brackets. Any comma within the 
brackets would be taken as a character to be recognized in the input text. 
Ranges within a standard alphabetic or numeric order are indicated with a 
hyphen, -. The sequence [a-z], for instance, indicates any lowercase letter. 
Somewhat more interestingly, 

[A-Za-zO-9*&#] 
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is a regular expression that matches any letter (whether uppercase or lower- 
case), any digit, an asterisk, an ampersand, or a sharp character. Given the 
input text 

$$$$?? ????!!!*$$ $$$$$$S.+===r'-# (( 

the lexical analyzer with the previous specification in one of its rules will 
recognize the *, &, r, and #, perform on each recognition whatever action the 
rule specifies (we have not indicated an action here), and print out the rest of 
the text as it stands. 

The operators become especially powerful in combination. For example, 
the regular expression to recognize an identifier in many programming 
languages is 

[a-zA-Z] [0-9a-zA-Z]* 

An identifier in these languages is defined to be a letter followed by zero 
or more letters or digits. That is just what the regular expression says. The 
first pair of brackets matches any letter. The second pair, if it were not fol- 
lowed by a *, would match any digit or letter. The two pairs of brackets with 
their enclosed characters would then match any letter followed by a digit or a 
letter. But with the asterisk, *, the example matches any letter followed by 
any number of letters or digits. In particular, it would recognize the following 
as identifiers: 

e 

pay 

distance 
pH 

R2D2 

Note that it would not recognize the following as identifiers: 

notjLdeanJnFER 

Stimes 

$hello 

because not_idenTIFER has an embedded underscore; Stimes starts with a 
digit, not a letter; and $hello starts with a special character. Of course, you 
may want to write the specifications for these three examples as an exercise. 
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A potential problem with operator characters is how we can refer to them, 
as characters to look for, in our search pattern. The last example, for instance, 
will not recognize text with an * in it. lex solves the problem in one of two 
ways: a character enclosed in quotation marks or a character preceded by a \ 
is taken literally, that is, as part of the text to be searched for. To use the 
backslash method to recognize, say, an * followed by any number of digits, 
we can use the pattern 

\*[1-9]* 

To recognize a \ itself, we need two backslashes: \\. 
Actions 

Once lex recognizes a string matching the regular expression at the start 
of a rule, it looks to the right of the rule for the action to be performed. Kinds 
of actions include recording the token type found and its value, if any, replac- 
ing one token with another, and counting the number of instances of a token 
or token type. What you want to do is write these actions as program frag- 
ments in the host language C. An action may consist of as many statements 
as are needed for the job at hand. You may want to print out a message not- 
ing that the text has been found or a message transforming the text in some 
way. Thus, to recognize the expression Amelia Earhart and to note such 
recognition, the rule 

"Amelia Earhart" printf ( "found Amelia" ) ; 

would do. And to replace in a text lengthy medical terms with their 
equivalent acronyms, a rule such as 

Electroencephalogram printf ( "EBG" ) ; 

would be called for. To count the lines in a text, we need to recognize end- 
of-lines and increment a linecounter. lex uses the standard escape sequences 
from C like \n for end-of-line. To count lines we might have 

\n lineno++; 

where lineno, like other C variables, is declared in the definitions section that 
we will discuss later. 

lex stores every character string that it recognizes in a character array 
called yytext[J. You can print or manipulate the contents of this array as you 
want. Sometimes your action may consist of two or more C statements and 
you must (or for style and clarity, you may choose to) write it on several lines. 
To inform lex that the action is for one rule only, simply enclose the C code 
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in braces. For example, to count the total number of all digit strings in an 
input text, print the running total of the number of digit strings (not their 
sum), and print out each one as soon as it is found, your lex code might be 

+?[1-9]+ { digstmgcount++; 

printf ( "%d" ,digstmgcx>unt) ; 
printf("%s", yytexfc); } 

This specification matches digit strings whether they are preceded by a plus 
sign or not, because the ? indicates that the preceding plus sign is optional. In 
addition, it will catch negative digit strings, because that portion following the 
minus sign, -, will match the specification. The next section explains how to 
distinguish negative from positive integers. 



Advanced lex Usage 

The lex command provides a suite of features that lets you process input 
text riddled with quite complicated patterns. These include rules that decide 
what specification is relevant, when more than one seems so at first; functions 
that transform one matching pattern into another; and the use of definitions 
and subroutines. Before considering these features, you may want to affirm 
your understanding thus far by examining an example drawing together 
several of the points already covered. 

-[0-9]+ printf ("negative integer"); 

+?[0-9]+ printf ("positive integer"); 

-0.[0-9]+ printf ("negative fracticm, no vdiole number part"); 

rail[ ]+road printf ("railroad is one ward"); 

crock printf ("Here's a cxodk"); 

fvmction subp(rogoount++; 

G[a-zA-Z]* { printf ("raay have a G vrord here: ", yytext); 
Gstring<xjunt++; } 

The first three rules recognize negative integers, positive integers, and 
negative fractions between and -1. Use of the terminating + in each specifi- 
cation ensures that one or more digits compose the number in question. Each 
of the next three rules recognizes a specific pattern. The specification for rail- 
road matches cases where one or more blanks intervene between the two syll- 
ables of the word. In the cases of railroad and crook, you may have simply 
printed a synonym rather than the messages stated. The rule recognizing a 
function increments a counter. The last rule illustrates several points: 
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■ The braces specify an action sequence extending over several lines. 

■ Its action uses the lex array yytext[], which stores the recognized char- 
acter string. 

■ Its specification uses the * to indicate that zero or more letters may fol- 
low the G. 

Some Special Features 

Besides storing the recognized character string in yytext[], lex automati- 
cally counts the number of characters in a match and stores it in the variable 
yyleng. You may use this variable to refer to any specific character just 
placed in the array yytext[]. Remember that C numbers locations in an array 
starting with 0, so to print out the third digit (if there is one) in a just recog- 
nized integer, you might write 

[1-9]+ {if (yyleng > 2) 

p(riiitf("%c% yytext[2]); } 

lex follows a number of high-level rules to resolve ambiguities that may 
arise from the set of rules that you write. Prima facie, any reserved word, for 
instance, could match two rules. In the lexical analyzer example developed 
later in the section on lex and yacc, the reserved word end could match the 
second rule as well as the seventh, the one for identifiers. 



lex follows the rule that where there is a match with two or more rules in a 
NOTE specification, the first rule is the one whose action will be executed. 

\^ 

By placing the rule for end and the other reserved words before the rule for 
identifiers, we ensure that our reserved words will be duly recognized. 

Another potential problem arises from cases where one pattern you are 
searching for is the prefix of another. For instance, the last two rules in the 
lexical analyzer example above are designed to recognize > and >=, If the 
text has the stiing >= at one point, you might worry that the lexical analyzer 
would stop as soon as it recognized the > character to execute the rule for > 
rather than read the next character and execute the rule for >=. 
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NOTE 



lex follows the rule that it matches the longest character string possible 
and executes the rule for that. 



Here it would recognize the >= and act accordingly. As a further example, 
the rule would enable you to distinguish + from ++ in a program in C. 

Still another potential problem exists when the analyzer must read charac- 
ters beyond the string you are seeking because you cannot be sure you have 
in fact found the string until you have read the additional characters. These 
cases reveal the importance of trailing context. The classic example here is the 
DO statement in FORTRAN. In the statement 

DO 50 k = 1 , 20, 1 

we cannot be sure that the first 1 is the initial value of the index k until we 
read the first comma. Until then, we might have the assignment statement 

D050k = 1 

(Remember that FORTRAN ignores all blanks.) The way to handle this is to 
use the forward-looking slash, / (not the backslash, \), which signifies that 
what follows is trailing context, something not to be stored in yytext[], 
because it is not part of the token itself. So the rule to recognize the FOR- 
TRAN DO statement could be 

30/[ ]*[0-9][ ]*[a-z A-Z0-9]+=[a-z A-Z0-9]+, p(rintf( "found DO") ; 

Different versions of FORTRAN have limits on the size of identifiers, here the 
index name. To simplify the example, the rule accepts an index name of any 
length. 

lex uses the $ as an operator to mark a special trailing context — the end of 
line. (It is therefore equivalent to \n.) An example would be a rule to ignore 
all blanks and tabs at the end of a line: 

[ \t]+$ ; 

On the other hand, if you want to match a pattern only when it starts a line, 
lex offers you the circumflex, as the operator. The formatter nroff, for 
example, demands that you never start a line with a blank, so you might want 
to check input to nroff with some such rule as: 
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] printf( "error: renove leading blaiik" ) ; 

Finally, some of your action statements themselves may require your read- 
ing another character, putting one back to be read again a moment later, or 
writing a character on an output device, lex supplies three functions to han- 
dle these tasks— input(), unput(c), and output(c), respectively. One way to 
ignore all characters between two special characters, say between a pair of 
double quotation marks, would be to use input(), thus: 

\" vdiile (ir^tO != •"•); 

Upon finding the first double quotation mark, the generated a.out will simply 
continue reading all subsequent characters, so long as none is a quotation 
mark, and not again look for a match until it finds a second double quotation 
mark. 

To handle special I/O needs, such as writing to several files, you may use 
standard I/O routines in C to rewrite the functions input(), unput(c), and out- 
put. These and other programmer-defined functions should be placed in your 
subroutine section. Your new routines will then replace the standard ones. 
The standard input(), in fact, is equivalent to getchar(), and the standard 
output(c) is equivalent to putchar(c). 

There are a number of lex routines that let you handle sequences of char- 
acters to be processed in more than one way. These include yymore(), 
yyless(n), and REJECT. Recall that the text matching a given specification is 
stored in the array yytext[]. In general, once the action is performed for the 
specification, the characters in yytext[] are overwritten with succeeding char- 
acters in the input stream to form the next match. The function yymoreO, by 
contrast, ensures that the succeeding characters recognized are appended to 
those already in yytext[]. This lets you do one thing and then another, when 
one string of characters is significant and a longer one, which includes the 
first, is significant as well. Consider a character string bound by Bs and inter- 
spersed with one at an arbitrary location. 

fi« • aB. • vB 

In a simple code-deciphering situation, you may want to count the 
number of characters between the first and second B's and add it to the 
number of characters between the second and third B. (Only the last B is not 
to be counted.) The code to do this is 
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BC'^B]* { if (flag = 0) 

save = yyleng; 
flag = 1; 
yyitoreO; 
else { 

iit5Xxrtairtno = save + yyleng; 
flag = 0; } 

} 

where flag, save, and importantno are declared (and at least flag initialized 
to 0) in the definitions section. The flag distinguishes the character sequence 
terminating just before the second B from that terminating just before the 
third. 

The function yyless(n) lets you reset the end point of the string to be con- 
sidered to the nth character in the original yytext[]. Suppose you are again in 
the code-deciphering business, and the gimmick here is to work with only half 
the characters in a sequence ending with a certain one, say upper- or lower- 
case Z. The code you want might be 

[a-yA-Y]+[Zz] { yyless(yyleng/2) ; 

. . , pnDcess first half of string. . . } 

Finally, the function REJECT lets you more easily process strings of char- 
acters REJECT does this by immediately jumping to the next rule and its 
specification without changing the contents of yytext[]. If you want to count 
the number of occurrences both of the regular expression snapdragon and of 
its subexpression dragon in an input text, the following will do: 

snapdragon {cxMntflowers++; BEJECT;} 
dragon oountinDnsters++ ; 

As an example of one pattern overlapping another, the following counts 
the number of occurrences of the expressions comedian and diana, even 
where the input text has sequences such as comediana..: 

ooroedian {caniccount++; REJECT;} 

diana princesscxxint++; 
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Note that the actions here may be considerably more complicated than 
simply incrementing a counter. In all cases, the counters and other necessary 
variables are declared in the definitions section commencing the lex specifica- 
tion. 

Definitions 

The lex definitions section may contain any of several classes of items. 
The most critical are external definitions, #include statements, and abbrevia- 
tions. Recall that for legal lex source this section is optional, but in most cases 
some of these items are necessary. External definitions have the form and 
function that they do in C. They declare that variables globally defined else- 
where (perhaps in another source file) will be accessed in your lex-generated 
a.out. Consider a declaration from an example to be developed later: 

extern int tdkval; 

When you store an integer value in a variable declared in this way, it will 
be accessible in the routine, say a parser, that calls it. If, on the other hand, 
you want to define a local variable for use within the action sequence of one 
rule (as you might for the index variable for a loop), you can declare the vari- 
able at the start of the action itself, right after the left brace, { . 

The purpose of the #include statement is the same as in C: to include 
files of importance for your program. Some variable declarations and4ex 
definitions might be needed in more than one lex source file. It is then 
advantageous to place them all in one file, to be included in every file that 
needs them. One example occurs in using lex with yacc, which generates 
parsers that call a lexical analyzer. In this context, you should include the file 
y.tab.h, which may contain #defines for token names. Like the declarations, 
#include statements should come between %{ and }%, thus: 

%{ 

#include "ytab.h" 
extern djit tciJcval; 
iixt lijieno; 
%} 

In the definitions section, after the %} that ends your #include's and 
declarations, place your abbreviations for regular expressions to be used in the 
rules section. The abbreviation appears on the left of the line and, separated 
by one or more spaces, its definition or translation appears on the right. 
When you use abbreviations in your rules, enclose them within braces. 
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NOTE 



The purpose of abbreviations is to avoid needless repetition in writing 
your specifications and to provide clarity in reading them. 



As an example, reconsider the lex source reviewed at the beginning of this 
section on advanced lex usage. The use of definitions simplifies our later 
reference to digits, letters, and blanks. This is especially true if the specifica- 
tions appear several times: 

D [0-9] 

L [a-zA-Z] 

B [ ] 
%^ 

— {D}+ printzf ("negative integer"); 

+?{D}+ porintf ("positive integer"); 

-0.{D}+ printf ("negative fraction"); 

G{L}* printf ("may have a G vroaxl here") ; 
rail{B}+road printf ( "railroad is one word"); 

cacook printf ( "crimijial" ) ; 

\"\ . /{B}+ printf ( " . \" " ) ; 



The last rule, nev^ly added to the example and somev^hat more complex 
than the others, ensures that a period alv^^ays precedes a quotation mark at the 
end of a sentence. It v^rould change example " . to example. " 

Subroutines 

You may want to use subroutines in lex for much the same reason that 
you do so in other programming languages. Action code that is to be used for 
several rules can be vratten once and called when needed. As with defini- 
tions, this can simplify the vmting and reading of programs. The function 
put-Jn— tabl(), to be discussed in the next section on lex and yacc, is a good 
candidate for a subroutine. 

Another reason to place a routine in this section is to highlight some code 
of interest or to simplify the rules section, even if the code is to be used for 
one rule only. As an example, consider the following routine to ignore com- 
ments in a language like C where comments occur between /* and */ : 
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"/*" skipannts( ) ; 

/* rest of rules ♦/ 

skipcnints( ) 
{ 

for( ; ; ) 
{ 

\diile (ir^tO 1= '*•); 
if (iipatO != V) { 

unpat(yytext[yylencr-1] ) ; 
else return; 

} 

} 

There are three points of interest in this example. First, the unput(c) func- 
tion (putting back the last character read) is necessary to avoid missing the 
final / if the comment ends unusually with a ♦*/ . In this case, eventually 
having read an *, the analyzer finds that the next character is not the terminal 
/ and must read some more. Second, the expression yytext[yyleng-l] picks 
out that last character read. Third, this routine assumes that the comments 
are not nested. (This is indeed the case with the C language.) If, unlike C, 
they are nested in the source text, after input()ing the first */ ending the inner 
group of comments, the a.out will read the rest of the comments as if they 
were part of the input to be searched for patterns. 

Other examples of subroutines would be programmer-defined versions of 
the I/O routines input(), unput(c), and output(), discussed above. Subrou- 
tines such as these that may be exploited by many different programs would 
probably do best to be stored in their own individual file or library to be 
called as needed. The appropriate #include statements would then be neces- 
sary in the definitions section. 



Using lex with yacc 

If you work on a compiler project or develop a program to check the vali- 
dity of an input language, you may want to use the UNIX System program 
tool yacc. yacc generates parsers, programs that analyze input to ensure that 
it is syntactically correct, (yacc is discussed in detail in Chapter 6 of this 
guide.) lex often forms a fruitful union with yacc in the compiler 
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development context. Whether or not you plan to use lex with yacc, be sure 
to read this section because it covers information of interest to all lex pro- 
grammers. 

The lexical analyzer that lex generates (not the file that stores it) takes the 
name yylex(). This name is convenient because yacc calls its lexical analyzer 
by this very name. To use lex to create the lexical analyzer for the parser of a 
compiler, you want to end each lex action with the statement return token, 
where token is a defined term whose value is an integer. The integer value of 
the token returned indicates to the parser what the lexical analyzer has found. 
The parser, whose file is called y.tab.c by yacc, then resumes control and 
makes another call to the lexical analyzer when it needs another token. 

In a compiler, the different values of the token indicate what, if any, 
reserved word of the language has been found or whether an identifier, con- 
stant, arithmetic operand, or relational operator has been found. In the latter 
cases, the analyzer must also specify the exact value of the token: what the 
identifier is, whether the constant, say, is 9 or 888, whether the operand is + 
or * (multiply), and whether the relational operator is = or >. Consider the 
following portion of lex source for a lexical analyzer for some programming 
language perhaps slightly reminiscent of Ada: 



begin 
end 

if 

package 



retumCBEJGIN) ; 
retum(END) ; 
retum(VgnTiE) ; 
retum(IF) ; 
retximC PACKAGE) ; 
retum(REVERSE) ; 
retum(LOOP) ; 



reverse 



loop 

[a-zArZ] [a-2A-Z0-9]* 



{ tx3kval = pat:_in_tabl( ) ; 
ret:iim(IDENrXFIER) ; } 



[0-9]+ 



{ tokval = pat_in_tabl( ) ; 
retum(INTBSER); } 



\+ 



{ toikval = PLUS; 

retum(ARiaHOP) ; } 



{ tokval = MINUS; 

retum(ARrilK)P) ; } 



> 



{ tckval = GSElEA!rER; 
return ( RELOP ) ; } 



>= 



{ tokval = S^EAXERBQL; 
retum(PELOP) ; } 
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Despite appearances, the tokens returned and the values assigned to 
tokval, are indeed integers. Good programming style dictates that we use 
informative terms such as BEGIN, END, WHILE, and so forth to signify the 
integers the parser understands, rather than use the integers themselves. You 
establish the association by using #define statements in your parser calling 
routine in C. For example, 

define BEGIN 1 
#def ine WD 2 

#def ine FLOS 7 



If the need arises to change the integer for some token type, you then 
change the #define statement in the parser rather than hunt through the 
entire program, changing every occurrence of the particular integer. In using 
yacc to generate your parser, it is helpful to insert the statement 

#iiiclude y.tab.h 

into the definitions section of your lex source. The file y.tab.h provides 
#define statements that associate token names such as BEGIN, END, and so 
on with the integers of significance to the generated parser. 

To indicate the reserved words in the example, the returned integer values 
suffice. For the other token types, the integer value of the token type is stored 
in the programmer-defined variable tokval. This variable, whose definition 
was an example in the definitions section, is globally defined so that the 
parser as well as the lexical analyzer can access it. yacc provides the variable 
yylval for the same purpose. 

Note that the example shows two ways to assign a value to tokval. First, 
a function put_in_tabl() places the name and type of the identifier or con- 
stant in a symbol table so that the compiler can refer to it in this or a later 
stage of the compilation process. More to the present point, put_in— tabl() 
assigns a type value to tokval so that the parser can use the information 
immediately to determine the syntactic correctness of the input text. The 
function put_in_tabl() would be a routine that the compiler writer might 
place in the subroutines section discussed later. Second, in the last few 
actions of the example, tokval is assigned a specific integer indicating which 
operand or relational operator the analyzer recognized. If the variable PLUS, 
for instance, is associated with the integer 7 by means of the #define state- 
ment above, then when a + sign is recognized, the action assigns to tokval 
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the value 1, which indicates the +. The analyzer indicates the general class of 
operator by the value it returns to the parser (in the example, the integer sig- 
nified by ARITHOP or RELOP). 
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As you review the following few steps, you might recall Figure 5-1 at the 
start of the chapter. To produce the lexical analyzer in run 

lex lex.l 

where lex.1 is the file containing your lex specification. The name lex.1 is 
conventionally the favorite, but you may use whatever name you want. The 
output file that lex produces is automatically called lex.yy.c; this is the lexical 
analyzer program that you created with lex. You then compile and link this 
as you would any C program, making sure that you invoke the lex library 
with the -11 option: 

cc lex.yy.c -11 

The lex library provides a default main() program that calls the lexical 
analyzer under the name yylexQ, so you need not supply your ov^m main(). 

If you have the lex specification spread across several files, you can run 
lex with each of them individually, but be sure to rename or move each 
lex.yy.c file (with mv) before you run lex on the next one. Otherwise, each 
will overwrite the previous one. Once you have all the generated .c files, you 
can compile all of them, of course, in one command line. 

With the executable a.out produced, you are ready to analyze any desired 
input text. Suppose that the text is stored under the file name textin (this 
name is arbitrary). The lexical analyzer a.out by default takes input from your 
terminal. To have it take the file textin as input, use redirection, thus: 

a.out < textin 

By default, output will appear on your terminal. You can redirect this as well: 
a.out < textin > textout 

In running lex with yacc, either may be run first, 

yacc -d grammar.y 
lex lexA 

spawns a parser in the file y.tab.c. (The -d option creates the file y.tab.h, 
which contains the #define statements that associate the yacc-assigned 
integer token values with the user-defined token names.) To compile and link 
the output files produced, run 
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cc lex.yy.c y.tab.c -ly -11 

Note that the yacc library is loaded (with the -ly option) before the lex library 
(with the -11 option) to ensure that the main() program supplied will call the 
yacc parser. 

There are several options available with the lex command. If you use one 
or more of them, place them between the command name lex and the file 
name argument. If you care to see the C program, lex.yy.c, that lex generates 
on your terminal (the default output device), use the -t option. 

lex -t lex.l 

The -V option prints out for you a small set of statistics describing the so- 
called finite automata that lex produces with the C program lex.yy.c. (For a 
detailed account of finite automata and their importance for lex, see the Aho, 
Sethi, and UUman text. Compilers: Principles, Techniques, and Tools, Addison- 
Wesley, 1986.) 

lex uses a table (a two-dimensional array in C) to represent its finite auto- 
maton. The maximum number of states that the finite automaton requires is 
set by default to 500. If your lex source has a large number of rules or the 
rules are very complex, this default value may be too small. You can enlarge 
the value by placing the following entry in the definitions section of your lex 
source; 

%Ci 700 

This entry tells lex to make the table large enough to handle as many as 
700 states. (The -v option will indicate how large a number you should 
choose.) If you have need to increase the maximum number of state transi- 
tions beyond 2000, the designated parameter is a, thus: 

%a 2800 

Finally, check the Programmer's Reference Manual page on lex for a list of 
all the options available with the lex command. In addition, review the paper 
by Lesk (the originator of lex) and Schmidt, "Lex — A Lexical Analyzer Gen- 
erator, in volume 5 of the UNIX Programmer's Manual, Holt, Rinehart, and 
Winston, 1986. It is somewhat dated, but offers several interesting examples. 
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This tutorial has introduced you to lex programming. As with any pro- 
gramnung language, the way to master it is to write programs and then write 
some more. 
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Introduction 



The yacc program provides a general tool for imposing structure on the 
input to a computer program. The yacc user prepares a specification that 
includes the following: 

■ a set of rules to describe the elements of the input 

■ code to be invoked when a rule is recognized 

■ either a definition or declaration of a low-level routine to examine the 
input 

yacc then turns the specification into a C language function that examines 
the input stream. This function, called a parser, works by calling the low- 
level input scanner. The low-level input scanner, called a lexical analyzer, 
picks up items from the input stream. The selected items are known as 
tokens. Tokens are compared to the input construct rules, called grammar 
rules. When one of the rules is recognized, the user code supplied for this 
rule, (an action) is invoked. Actions are fragments of C language code. They 
can return values and make use of values returned by other actions. 

The heart of the yacc specification is the collection of grammar rules. 
Each rule describes a construct and gives it a name. For example, one gram- 
mar rule might be 

datie : month name day * , ' jrear ; 

where date, month_name, day, and year represent constructs of interest; 
presumably, month— name, day, and year are defined in greater detail else- 
where. In the example, the comma is enclosed in single quotes. This means 
that the comma is to appear literally in the input. The colon and semicolon 
merely serve as punctuation in the rule and have no significance in evaluating 
the input. With proper definitions, the input 

July 4, 1776 

might be matched by the rule. 

The lexical analyzer is an important part of the parsing function. This 
user-supplied routine reads the input stream, recognizes the lower-level con- 
structs, and communicates these as tokens to the parser. The lexical analyzer 
recognizes constructs of the input stream as terminal symbols; the parser 
recognizes constructs as nonterminal symbols. To avoid confusion, we will 
refer to terminal symbols as tokens. 
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There is considerable leeway in deciding whether to recognize constructs 
using the lexical analyzer or grammar rules. For example, the rules 




might be used in the above example. While the lexical analyzer only needs to 
recognize individual letters, such low-level rules tend to waste time and space, 
and may complicate the specification beyond the ability of yacc to deal with 
it. Usually, the lexical analyzer recognizes the month names and returns an 
indication that a month^ame is seen. In this case, month_name is a token 
and the detailed rules are not needed. 

Literal characters such as a comma must also be passed through the lexical 
analyzer and are also considered tokens. 

Specification files are very flexible. It is relatively easy to add to the 
above example the rule 

date : month V day V year ; 
allowing 

7/4/1776 
as a synonym for 

July 4, 1776 

on input. In most cases, this new rule could be slipped into a working system 
with minimal effort and little danger of disrupting existing input. 
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The input being read may not conform to the specifications. With a left- 
to-right scan, input errors are detected as early as is theoretically possible. 
Thus, not only is the chance of reading and computing with bad input data 
substantially reduced, but the bad data usually can be found quickly. Error 
handling, provided as part of the input specifications, permits the reentry of 
bad data or the continuation of the input process after skipping over the bad 
data. 

In some cases, yacc fails to produce a parser when given a set of specifica- 
tions. For example, the specifications may be self-contradictory, or they may 
require a more powerful recognition mechanism than that available to yacc. 
The former cases represent design errors; the latter cases often can be 
corrected by making the lexical analyzer more powerful or by rewriting some 
of the grammar rules. While yacc cannot handle all possible specifications, its 
power compares favorably with similar systems. Moreover, the constructs that 
are difficult for yacc to handle are also frequently difficult for human beings to 
handle. Some users have reported that the discipline of formulating valid 
yacc specifications for their input revealed errors of conception or design early 
in the program development. 

The remainder of this chapter describes the following subjects: 

■ the basic process of preparing a yacc specification 

■ the parser operation 

■ how to handle ambiguities 

■ how to handle operator precedences in arithmetic expressions 

■ error detection and recovery 

■ the operating environment and special features of the parsers yacc pro- 
duces 

■ suggestions to improve the style and efficiency of the specifications 

■ advanced topics 

In addition, there are two examples and a summary of the yacc input syn- 
tax. 
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Names refer to either tokens or nonterminal symbols, yacc requires token 
names to be declared as such. While the lexical analyzer may be included as 
part of the specification file, it is perhaps more in keeping with modular 
design to keep it as a separate file. Like the lexical analyzer, other subroutines 
may be included as well. Thus, every specification file theoretically consists of 
three sections: the declarations, (grammar) rules, and subroutines. Sections 
are separated by double percent signs, % % (the single percent sign is gen- 
erally used in yacc specifications as an escape character). 

A full specification file looks like this: 

declarations 
%% 
rules 
%% 

subroutines 

when all sections are used. The declarations and subroutines sections are 
optional. The smallest legal yacc specification is 

rules 

Blanks, tabs, and newlines are ignored, but they may not appear in names 
or multicharacter reserved symbols. Comments may appear wherever a name 
is legal. They are enclosed in /* *// in the C language. 

The rules section is made up of one or more grammar rules. A grammar 
rule has the form 

A : BODY ; 

where A represents a nonterminal symbol, and BODY represents a sequence 
of zero or more names and literals. The colon and the semicolon are yacc 
punctuation. 

Names may be of any length and may be made up of letters, dots, under- 
scores, and digits, although a digit may not be the first character of a name. 
Uppercase and lowercase letters are distinct. The names used in the body of a 
grammar rule may represent tokens or nonterminal symbols. 
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A literal consists of a character enclosed in single quotes, As in the C 
language, the backslash, V is an escape character within literals, and all the C 
language escapes are recognized. Thus: 



'\n' 


newline 


'\r' 


return 


'\" 


single quote ( ' ) 


AV 


backslash ( \ ) 


'\t' 


tab 


'\b' 


backspace 


'\f 


form feed 


'\xxx' 


XXX in octal notation 



are understood by yacc. For a number of technical reasons, the NULL charac- 
ter (\0 or 0) should never be used in grammar rules. 

If there are several grammar rules with the same left-hand side, the verti- 
cal bar, I, can be used to avoid rewriting the left-hand side. In addition, the 
semicolon at the end of a rule is dropped before a vertical bar. Thus the 
grammar rules 

A : B C D ; 
A : E F ; 
A : G ; 

can be given to yacc as 

A : B C D 
I E F 
I G 
♦ 

by using the vertical bar. It is not necessary that all grammar rules with the 
same left side appear together in the grammar rules section, although it makes 
the input more readable and easier to change. 

If a nonterminal symbol matches the empty string, this can be indicated 

by 

epsilcn : ; 

The blank space following the colon is understood by yacc to be a nontermi- 
nal symbol named epsilcn. 
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Names representing tokens must be declared. This is most simply done 
by writing 

%1jQken namel xiaiae2 . . . 

in the declarations section. Every name not defined in the declarations section 
is assumed to represent a nonterminal symbol. Every nonterminal symbol 
must appear on the left side of at least one rule. 

Of all the nonterminal symbols, the start symbol has particular impor- 
tance. By default, the start S)TObol is taken to be the left-hand side of the first 
grammar rule in the rules section. It is possible and desirable to explicitly 
declare the start symbol in the declarations section using the %start keyword. 

%start symbol 

The end of the input to the parser is signaled by a special token, called 
the end-marker. The end-marker is represented by either a zero or a negative 
number. If the tokens up to, but not including, the end-marker form a con- 
struct that matches the start symbol, the parser function returns to its caller 
after the end-marker is seen and accepts the input. If the end-marker is seen 
in any other context, it is an error. 

It is the job of the user-supplied lexical analyzer to return the end-marker 
when appropriate. Usually the end-marker represents some reasonably obvi- 
ous I/O status, such as end of file or end of record. 



Actions 

With each grammar rule, the user may associate actions to be performed 
when the rule is recognized. Actions may return values and may obtain the 
values returned by previous actions. Moreover, the lexical analyzer can return 
values for tokens if desired. 

An action is an arbitrary C language statement and as such can do input 
and output, call subroutines, and alter arrays and variables. An action is 
specified by one or more statements enclosed in curly braces, {, and }. For 
example: 

A : B 

{ 

hello( 1, "abc" ); 

} 
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and 

XXX : YYY ZZZ 

{ 

(void) priiitf ( "a message\n" ) ; 
flag = 25; 

} 

are grammar rules with actions. 

The dollar sign symbol, $, is used to facilitate communication between the 
actions and the parser. The pseudo-variable $$ represents the value returned 
by the complete action. For example, the action 

{ $$ = 1; } 

rehams the value of one; in fact, that is all it does. 

To obtain the values returned by previous actions and the lexical analyzer, 
the action may use the pseudo-variables $1, $2, $«. These refer to the 
values returned by components 1 through n of the right side of a rule, with 
the components being numbered from left to right. If the rule is 

A : B C D ; 

then $2 has the value returned by C, and $3 the value returned by D. The 
rule 

esqpr : expr ; 

provides a common example. One would expect the value returned by this 
rule to be the value of the expr within the parentheses. Since the fu-st com- 
ponent of the action is the literal left parenthesis, the desired logical result can 
be indicated by 

expr : • ( • expr ' ) ' 
{ 

$$ = $2 ; 

} 
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By default the value of a rule is the value of the first element in it ($1). 
Thus, grammar rules of the form 

A : B ; 

frequently need not have an explicit action. In previous examples, all the 
actions came at the end of rules. Sometimes, it is desirable to get control 
before a rule is fully parsed, yacc permits an action to be written in the mid- 
dle of a rule as well as at the end. This action is assumed to return a value 
accessible through the usual $ mechanism by the actions to the right of it. In 
turn, it may access the values returned by the symbols to its left. Thus, in the 
rule below the effect is to set x to 1 and y to the value returned by C. 




{ 

$$ = 1; 

} 
c 



{ 

X = $2; 
y = $3; 

} 




Actions that do not terminate a rule are handled by yacc by manufactur- 
ing a new nonterminal symbol name and a new rule matching this name to 
the empty string. The interior action is the action triggered by recognizing 
this added rule, yacc treats the above example as if it had been written as fol- 
lows (where $ACT is an empty action): 
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In many applications, output is not done directly by the actions. A data 
structure, such as a parse tree, is constructed in memory, and transformations 
are applied to it before output is generated. Parse trees are particularly easy 
to construct given routines to build and maintain the tree structure desired. 
For example, suppose there is a C function node written so that the call 

node( L, n1, n2 ) 

creates a node with label L and descendants nl and n2 and returns the index 
of the newly created node. Then a parse tree can be built by supplying 
actions such as 

expr : expr ' + • expr 
{ 

$$ = node( $1, $3 ); 

} 

in the specification. 

The user may define other variables to be used by the actions. Declara- 
tions and definitions can appear in the declarations section enclosed in the 
marks %{ and %}. These declarations and definitions have global scope, so 
they are known to the action statements and can be made known to the lexi- 
cal analyzer. For example: 

%{ int variable = 0; %} 
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could be placed in the declarations section, making variable accessible to all 
of the actions. Users should avoid names beginning with yy because the yacc 
parser uses only such names. In the examples shown thus far, all the values 
are integers. A discussion of values of other t3^es is found in the section 
" Advanced Topics. " 



Lexical Analysis 

The user must supply a lexical analyzer to read the input stream and com- 
municate tokens (with values, if desired) to the parser. The lexical analyzer is 
an integer-valued function called yylex. The function returns an integer, the 
token number, representing the kind of token read. If there is a value associ- 
ated with that token, it should be assigned to the external variable yylval. 

The parser and the lexical analyzer must agree on these token numbers in 
order for communication between them to take place. The numbers may be 
chosen by yacc or the user. In either case, the #define mechanism of C 
language is used to allow the lexical analyzer to return these numbers symbol- 
ically. For example, suppose that the token name DIGIT has been defined in 
the declarations section of the yacc specification file. The relevant portion of 
the lexical analyzer might look like 
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int yylex( ) 
{ 



extern int yylval; 
int c; 

c = getchar( ) ; 

switch (c) 
{ 

case '0': 
case 'V: 

case '9': 
yylval = c - '0'; 
return (DIGIT); 



to return the appropriate token. 

The intent is to return a token number of DIGIT and a value equal to the 
numerical value of the digit. Provided that the lexical analyzer code is placed 
in the subroutines section of the specification file, the identifier DIGIT is 
defined as the token number associated with the token DIGIT. 

This mechanism leads to clear, easily modified lexical analyzers. The only 
pitfall to avoid is using any token names in the grammar that are reserved or 
significant in C language or the parser. For example, the use of token names 
if or while will almost certainly cause severe difficulties when the lexical 
analyzer is compiled. The token name error is reserved for error handling 
and should not be used naively. 

In the default situation, token numbers are chosen by yacc. The default 
token number for a literal character is the numerical value of the character in 
the local character set. Other names are assigned token numbers starting at 
257. If the yacc command is invoked with the -d option, a file called y.tab.h 
is generated, y.tab.h contains #define statements for the tokens. 
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If the user prefers to assign the token numbers, the first appearance of the 
token name or literal in the declarations section must be followed immediately 
by a nonnegative integer. This integer is taken to be the token number of the 
name or literal. Names and literals not defined this way are assigned default 
definitions by yacc. The potential for duplication exists here. Care must be 
taken to make sure that all token numbers are distinct. 

For historical reasons, the end-marker must have token number or nega- 
tive. This token number cannot be redefined by the user. Thus, all lexical 
analyzers should be prepared to return or a negative number as a token 
upon reaching the end of their input. 

A very useful tool for constructing lexical analyzers is the lex utility. Lexi- 
cal analyzers produced by lex are designed to work in close harmony with 
yacc parsers. The specifications for these lexical analyzers use regular expres- 
sions instead of grammar rules, lex can be easily used to produce quite com- 
plicated lexical analyzers, but there remain some languages (such as FOR- 
TRAN), which do not fit any theoretical framework and whose lexical 
analyzers must be crafted by hand. 
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The yacc command turns the specification file into a C language pro- 
cedure, which parses the input according to the specification given. The algo- 
rithm which is used to go from the specification to the parser is complex and 
will not be discussed here. The parser itself, though, is relatively simple and 
understanding its usage will make treatment of error recovery and ambiguities 
easier. 

The parser produced by yacc consists of a finite-state machine with a 
stack. The parser is also capable of reading and remembering the next input 
token (called the look-ahead token). The current state is always the one on 
the top of the stack. The states of the finite-state machine are given small 
integer labels. Initially, the machine is in state (the stack contains only state 
0) and no look-ahead token has been read. 

The machine has only four actions available — shift, reduce, accept, and 
error. The parser does a step in the following manner: 

1 . Based on its current state, the parser decides if it needs a look-ahead 
token to choose the action to be taken. If it needs one and does not 
have one, it calls yylex to obtain the next token. 

2. Using the current state and the look-ahead token if needed, the parser 
decides on its next action and carries it out. This may result in states 
being pushed on the stack or popped off the stack and in the look- 
ahead token being processed or left alone. 

The shift action is the most common action the parser takes. Whenever a 
shift action is taken, there is always a look-ahead token. For example, in 
state 56 there may be an action 

IF shift 34 

which says, in state 56, if the look-ahead token is IF, the current state (56) is 
pushed down on the stack, and state 34 becomes the current state (on the top 
of the stack). The look-ahead token is cleared. 

The reduce action keeps the stack from growing without bounds. The 
reduce actions are appropriate when the parser has seen the right-hand side 
of a grammar rule and is prepared to announce that it has seen an instance of 
the rule replacing the right side by the left side. It may be necessary to con- 
sult the look-ahead token to decide whether or not to reduce (usually it is not 
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necessary). In fact, the default action (represented by a dot) is often a reduce 
action. 

The reduce actions are associated with individual grammar rules. Gram- 
mar rules are also given small integer numbers, and this leads to some confu- 
sion. The action 

reduce 18 

refers to grammar rule 18, while the action 

IF shift 34 

refers to state 34. 

Suppose the rule 

A : X y z ; 

is being reduced. The reduce action depends on the left-hand symbol (A in 
this case) and the number of symbols on the right-hand side (three in this 
case). To reduce, first pop off the top three states from the stack. (In general, 
the number of states popped equals the number of symbols on the right side 
of the rule.) In effect, these states were the ones put on the stack while recog- 
nizing X, y, and z and no longer serve any useful purpose. After popping 
these states, a state is uncovered, which was the state the parser was in before 
beginning to process the rule. Using this uncovered state and the symbol on 
the left side of the rule, perform what is in effect a shift of A. A new state is 
obtained, pushed onto the stack, and parsing continues. There are significant 
differences between the processing of the left-hand symbol and an ordinary 
shift of a token, however, so this action is called a goto action. In particular, 
the look-ahead token is cleared by a shift but is not affected by a goto. In any 
case, the uncovered state contains an entry such as 

A goto 20 

causing state 20 to be pushed onto the stack and become the current state. 

In effect, the reduce action turns back the clock in the parse, popping the 
states off the stack to go back to the state where the right side of the rule was 
first seen. The parser then behaves as if it had seen the left side at that time. 
If the right-hand side of the rule is empty, no states are popped off of the 
stacks. The uncovered state is in fact the current state. 
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The reduce action is also important in the treatment of user-supplied 
actions and values. When a rule is reduced, the code supplied with the rule is 
executed before the stack is adjusted. In addition to the stack holding the 
states, another stack running in parallel with it holds the values returned from 
the lexical analyzer and the actions. When a shift takes place, the external 
variable yylval is copied onto the value stack. After the return from the user 
code, the reduction is carried out. When the goto action is done, the external 
variable yyval is copied onto the value stack. The pseudo- variables $1, $2, 
etc., refer to the value stack. 

The other two parser actions are conceptually much simpler. The accept 
action indicates that the entire input has been seen and that it matches the 
specification. This action appears only when the look-ahead token is the 
end-marker and indicates that the parser has successfully done its job. The 
error action, on the other hand, represents a place where the parser can no 
longer continue parsing according to the specification. The input tokens it has 
seen (together with the look-ahead token) cannot be followed by anything 
that would result in a legal input. The parser reports an error and attempts to 
recover the situation and resume parsing. The error recovery (as opposed to 
the detection of error) will be discussed later. 

Consider the following as a yacc specification: 




When yacc is invoked with the -v option, a file called y.output is pro- 
duced with a human-readable description of the parser. The y.output file 
corresponding to the above grammar (with some statistics stripped off the 
end) follows. 



yacc 6-15 



Parser Operation 



state 

$acoept : jhgraie $end 

DINS shift 3 
» error 

rhyme goto 1 
sound goto 2 

state 1 

$acoept : rhyin5_$enl 

$end accept 
• error 

state 2 

rhyme : soun4jplace 

nKr.T> shift 5 
error 

place goto 4 

state 3 

sound : DINiJXMG 

DONS shift 6 
. error 

state 4 

rhyme : sound placQ_ (1) 
reduce 1 

state 5 

place : twj.Tj_ (3) 
reduce 3 

state 6 

sound : DING DOtf^ (2) 
reduce 2 
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The actions for each state are specified, and there is a description of the pars- 
ing rules being processed in each state. The character is used to indicate 
what has been seen and what is yet to come in each rule. The following 
input 

DING DONG DELL 

can be used to track the operations of the parser. Initially, the current state is 
state 0. The parser needs to refer to the input in order to decide between the 
actions available in state 0, so the first token, DING, is read and becomes the 
look-ahead token. The action in state on DING is shift 3, state 3 is pushed 
onto the stack, and the look-ahead token is cleared. State 3 becomes the 
current state. The next token, DONG, is read and becomes the look-ahead 
token. The action in state 3 on the token DONG is shift 6, state 6 is pushed 
onto the stack, and the look-ahead is cleared. The stack now contains 0, 3, 
and 6. In state 6, without even consulting the look-ahead, the parser reduces 
by 

sound : DINS DCXflG 

which is rule 2. Two states, 6 and 3, are popped off of the stack uncovering 
state 0. Consulting the description of state (looking for a goto on sound), 

sound goto 2 

is obtained. State 2 is pushed onto the stack and becomes the current state. 

In state 2, the next token, DELL, must be read. The action is shift 5, so 
state 5 is pushed onto the stack, which now has 0, 2, and 5 on it, and the 
look-ahead token is cleared. In state 5, the only action is to reduce by rule 3. 
This has one symbol on the right-hand side, so one state, 5, is popped off, 
and state 2 is uncovered. The goto in state 2 on place (the left side of rule 3) 
is state 4. Now, the stack contains 0, 2, and 4. In state 4, the only action is to 
reduce by rule 1. There are two symbols on the right, so the top two states 
are popped off, uncovering state again. In state 0, there is a goto on rhyme 
causing the parser to enter state 1. In state 1, the input is read and the end- 
marker is obtained indicated by $end in the y.output file. The action in 
state 1 (when the end-marker is seen) successfully ends the parse. 

The reader is urged to consider how the parser works when confronted 
with such incorrect strings as DING DONG DONG, DING DONG, DING 
DONG DELL DELL, etc. A few minutes spent with this and other simple 
examples is repaid when problems arise in more complicated contexts. 
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A set of grammar rules is ambiguous if there is some input string that can 
be structured in two or more different ways. For example, the grammar rule 

expr : expr esqptr 

is a natural way of expressing the fact that one way of forming an arithmetic 
expression is to put two other expressions together with a minus sign between 
them. Unfortunately, this grammar rule does not completely specify the way 
that all complex inputs should be structured. For example, if the input is 

expr — exptr — expr 

the rule allows this input to be structured as either 

( expc - expr ) - expr 

or as 

exjpc - ( expr - expr ) 

(The first is called left association, the second right association.) 

yacc detects such ambiguities when it is attempting to build the parser. 
Given the input 

expr - expr - expr 

consider the problem that confronts the parser. When the parser has read the 
second expr, the input seen 

expr - ejqir 

matches the right side of the grammar rule above. The parser could reduce 
the input by applying this rule. After applying the rule, the input is reduced 
to expr (the left side of the rule). The parser would then read the final part of 
the input 

— Bxpr 

and again reduce. The effect of this is to take the left associative interpreta- 
tion. ^ 

Alternatively, if the parser sees 
expr - wpc 

it could defer the immediate application of the rule and continue reading the 
input until 
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e3cpr — expr — expc 

is seen. It could then apply the rule to the rightmost three symbols reducing 
them to expr, which results in 

expr — expr 

being left. Now the rule can be reduced once more. The effect is to take the 
right associative interpretation. Thus, having read 

expr — expr 

the parser can do one of two legal things, a shift or a reduction. It has no 
way of deciding between them. This is called a shift-reduce conflict. It may 
also happen that the parser has a choice of two legal reductions. This is 
called a reduce-reduce conflict. Note that there are never any shift-shift con- 
flicts. 

When there are shift-reduce or reduce-reduce conflicts, yacc still pro- 
duces a parser. It does this by selecting one of the valid steps wherever it has 
a choice. A rule describing the choice to make in a given situation is called a 
disambiguating rule. 

yacc invokes two default disambiguating rules: 

1 . In a shift-reduce conflict, the default is to do the shift. 

2. In a reduce-reduce conflict, the default is to reduce by the earlier 
grammar rule (in the yacc specification). 

Rule 1 implies that reductions are deferred in favor of shifts when there is 
a choice. Rule 2 gives the user rather crude control over the behavior of the 
parser in this situation, but reduce-reduce conflicts should be avoided when 
possible. 

Conflicts may arise because of mistakes in input or logic or because the 
grammar rules (while consistent) require a more complex parser than yacc can 
construct. The use of actions within rules can also cause conflicts if the action 
must be done before the parser can be sure which rule is being recognized. In 
these cases, the application of disambiguating rules is inappropriate and leads 
to an incorrect parser. For this reason, yacc always reports the number of 
shift-reduce and reduce-reduce conflicts resolved by Rule 1 and Rule 2. 
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In general, whenever it is possible to apply disambiguating rules to pro- 
duce a correct parser, it is also possible to rewrite the grammar rules so that 
the same inputs are read but there are no conflicts. For this reason, most pre- 
vious parser generators have considered conflicts to be fatal errors. Our 
experience has suggested that this rewriting is somewhat unnatural and pro- 
duces slower parsers. Thus, yacc will produce parsers even in the presence of 
conflicts. 

As an example of the power of disambiguating rules, consider 

Stat : IF ooEnd stat 

I IF cxand stat ELSE stat 



which is a fragment from a programming language involving an if-then-else 
statement. In these rules, IF and ELSE are tokens, cond is a nonterminal sym- 
bol describing conditional (logical) expressions, and stat is a nonterminal sym- 
bol describing statements. The first rule will be called the simple if rule and 
the second, the if-else rule. 

These two rules form an ambiguous construction because input of the 
form 

IF ( C1 ) IF ( C2 ) S1 ELSE S2 
can be structured according to these rules in two ways 



IF 

{ 



( C1 



IF 



C2 ) 
S1 



} 



S2 



or 



IF 

{ 



( C1 



IF 



C2 ) 
S1 



S2 



} 
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where the second interpretation is the one given in most programming 
languages having this construct; each ELSE is associated v^^ith the last preced- 
ing un-ELSE'd IF. In this example, consider the situation where the parser 
has seen 

IF ( C1 ) IF ( C2 ) S1 

and is looking at the ELSE. It can immediately reduce by the simple if rule to 
get 

IF ( C1 ) Stat 
and then read the remaining input 

ELSE S2 
and reduce 

IF ( C1 ) Stat ELSE S2 

by the if-else rule. This leads to the first of the above groupings of the input. 

On the other hand, the ELSE may be shifted, S2 read, and then the right- 
hand portion of 

IF ( C1 ) IF ( C2 ) S1 ELSE S2 

can be reduced by the if-else rule to get 

IF ( C1 ) Stat 

which can be reduced by the simple if rule. This leads to the second of the 
above groupings of the input which is usually desired. 

Once again, the parser can do two valid things — there is a shift-reduce 
conflict. The application of disambiguating rule 1 tells the parser to shift in 
this case, which leads to the desired grouping. 

This shift-reduce conflict arises only when there is a particular current 
input symbol, ELSE, and particular inputs, such as 

IF ( C1 ) IF ( C2 ) S1 

have already been seen. In general, there may be many conflicts, and each 
one will be associated with an input symbol and a set of previously read 
inputs. The previously read inputs are characterized by the state of the 
parser. 
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The conflict messages of yacc are best understood by examining the ver- 
bose (-v) option output file. For example, the output corresponding to the 
above conflict state might be 

23: shift-recauce conflict (shift 45, reduce 18) on ELSE 
state 23 

Stat : IF ( ccnd ) stat_ (18) 
Stat : IF ( oond ) stat^JlLSR stat 

ELSE shift 45 
reduce 18 

V. 

where the first line describes the conflict — giving the state and the input sym- 
bol. The ordinary state description gives the grammar rules active in the state 
and the parser actions. Recall that the underline marks the portion of the 
grammar rules, which has been seen. Thus in the example, in state 23 the 
parser has seen input corresponding to 

IF ( oond ) Stat 

and the two grammar rules shown are active at this time. The parser can do 
two possible things. If the input S3m\bol is ELSE, it is possible to shift into 
state 45. State 45 will have, as part of its description, the line 

Stat : IP ( ocffid ) stat ELSE_gtat 

because the ELSE will have been shifted in this state. In state 23, the alterna- 
tive action (describing a dot, .) is to be done if the input symbol is not men- 
tioned explicitly in the actions. In this case, if the input symbol is not ELSE, 
the parser reduces to 

stat : IF cx3nd stat 

by grammar rule 18. 
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Once again, notice that the numbers following shift commands refer to 
other states, while the numbers following reduce commands refer to grammar 
rule numbers. In the y.output file, the rule numbers are printed in 
parentheses after those rules, which can be reduced. In most states, there is a 
reduce action possible in the state, and this is the default command. The user 
who encounters unexpected shift-reduce conflicts will probably want to look 
at the verbose output to dedde whether the default actions are appropriate. 
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There is one common situation where the rules given above for resolving 
conflicts are not sufficient. This is in the parsing of arithmetic expressions. 
Most of the commonly used constructions for arithmetic expressions can be 
naturally described by the notion of precedence levels for operators, together 
with information about left or right associativity. It turns out that ambiguous 
grammars with appropriate disambiguating rules can be used to create parsers 
that are faster and easier to write than parsers constructed from unambiguous 
grammars. The basic notion is to write grammar rules of the form 

expr : expr OP eaqpr 

and 

expr : UNARy expr 

for all binary and unary operators desired. This creates a very ambiguous 
grammar with many parsing conflicts. To avoid ambiguity, the user specifies 
the precedence or binding strength of all the operators and the associates of 
the binary operators. This information is sufficient to allow yacc to resolve 
the parsing conflicts in accordance with these rules and construct a parser that 
realizes the desired precedences and associates. 

The precedences and associativities are attached to tokens in the declara- 
tions section. This is done by a series of lines beginning with a yacc keyword: 
%left, %right, or %nonassoc, followed by a list of tokens. All of the tokens 
on the same line are assumed to have the same precedence level and associa- 
tivity; the lines are listed in order of increasing precedence or binding 
strength. Thus: 

%left ' + • 
%left V 

describes the precedence and associativity of the four arithmetic operators. 
Plus and minus are left associative and have lower precedence than star and 
slash, which are also left associative. The keyword %right is used to describe 
right associative operators, and the keyword %nonassoc is used to describe 
operators, like the operator .LT. in FORTRAN, that may not associate with 
themselves. Thus: 

A .LT. B .LT, C 
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is illegal in FORTRAN and such an operator would be described with the key- 
word %nonassoc in yacc. As an example of the behavior of these declara- 
tions, the description 



%ric^t 
%left '+' 
%left V 



expr 



expr expr 

expr expr 

expr '—* expr 

expr * <;gf pr 

expr expr 
NAME 



might be used to structure the input 

a = b = c*d — e — f*g 
as follows 

a = ( b = ( ((c*d)-e) - (f»g) ) ) 

in order to perform the correct precedence of operators. When this mechan- 
ism is used, unary operators must, in general, be given a precedence. Some- 
times a unary operator and a binary operator have the same symbolic 
representation but different precedences. An example is unary and binary 
minus, -. 

Unary minus may be given the same strength as multiplication, or even 
higher, while binary minus has a lower strength than multiplication. The key- 
word, %prec, changes the precedence level associated with a particular gram- 
mar rule. The keyword %prec appears immediately after the body of the 
grammar rule, before the action or closing semicolon, and is followed by a 
token name or literal. It causes the precedence of the grammar rule to become 
that of the following token name or literal. For example, the rules 
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%left V 



expr 




expr 


expr 




expr 


expr 




expr 


expr 




expr 




expr 


5 


NAME 








might be used to give unary minus the same precedence as multiplication. 

A token declared by %left, %right, and %nonassoc need not be, but may 
be, declared by %token as well. 

Precedences and associativities are used by yacc to resolve parsing con- 
flicts. They give rise to the following disambiguating rules: 

1 . Precedences and associativities are recorded for those tokens and 
literals that have them. 

2. A precedence and associativity is associated with each grammar rule. 
It is the precedence and associativity of the last token or literal in the 
body of the rule. If the %prec construction is used, it overrides this 
default. Some grammar rules may have no precedence and associa- 
tivity associated with them. 

3 . When there is a reduce-reduce conflict or there is a shift-reduce con- 
flict and either the input symbol or the grammar rule has no pre- 
cedence and associativity, then the two default disambiguating rules 
given at the beginning of the section are used, and the conflicts are 
reported. 
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4. If there is a shift-reduce conflict, and both the grammar rule and the 
input character have precedence and associativity associated with 
them, then the conflict is resolved in favor of the action — shift or 
reduce — associated with the higher precedence. If precedences are 
equal, then associativity is used. Left associative implies reduce; right 
associative implies shift; nonassociating implies error. 

Conflicts resolved by precedence are not counted in the number of shift- 
reduce and reduce-reduce conflicts reported by yacc. This means that mis- 
takes in the specification of precedences may disguise errors in the input 
grammar. It is a good idea to be sparing v^dth precedences and use them in a 
cookbook fashion until some experience has been gained. The y.output file is 
very useful in deciding whether the parser is actually doing what was 
intended. 
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Error handling is an extremely difficult area, and many of the problems 
are semantic ones. When an error is found, for example, it may be necessary 
to reclaim parse tree storage, delete or alter symbol table entries, and/or, typi- 
cally, set switches to avoid generating any further output. 

It is seldom acceptable to stop all processing when an error is found. It is 
more useful to continue scanning the input to find further syntax errors. This 
leads to the problem of getting the parser restarted after an error. A general 
class of algorithms to do this involves discarding a number of tokens from the 
input string and attempting to adjust the parser so that input can continue. 

To allow the user some control over this process, yacc provides the token 
name error. This name can be used in grammar rules. In effect, it suggests 
places where errors are expected and recovery might take place. The parser 
pops its stack until it enters a state where the token error is legal. It then 
behaves as if the token error were the current look-ahead token and performs 
the action encountered. The look-ahead token is then reset to the token that 
caused the error. If no special error rules have been specified, the processing 
halts when an error is detected. 

In order to prevent a cascade of error messages, the parser, after detecting 
an error, remains in error state until three tokens have been successfully read 
and shifted. If an error is detected when the parser is already in error state, 
no message is given, and the input token is quietly deleted. 

As an example, a rule of the form 

Stat : error 

means that on a syntax error the parser attempts to skip over the statement in 
which the error is seen. More precisely, the parser scans ahead, looking for 
three tokens that might legally follow a statement and start processing at the 
first of these. If the beginnings of statements are not sufficiently distinctive, it 
may make a false start in the middle of a statement and end up reporting a 
second error where there is in fact no error. 

Actions may be used with these special error rules. These actions might 
attempt to reinitialize tables, reclaim symbol table space, etc. 
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Error rules such as these mentioned are very general but difficult to con- 
trol. Rules such as 

Stat : error * ; * 

are somewhat easier. Here, when there is an error, the parser attempts to skip 
over the statement but does so by skipping to the next semicolon. All tokens 
after the error and before the next semicolon cannot be shifted and are dis- 
carded. When the semicolon is seen, this rule will be reduced and any 
cleanup action associated with it performed. 

Another form of error rule arises in interactive applications where it may 
be desirable to permit a line to be reentered after an error. The following 
example 




is one way to do this. There is one potential difficulty with this approach. 
The parser must correctly process three input tokens before it admits that it 
has correctly resynchronized after the error. If the reentered line contains an 
error in the first two tokens, the parser deletes the offending tokens and gives 
no message. This is clearly unacceptable. For this reason, there is a mechan- 
ism that can force the parser to believe that error recovery has been accom- 
plished. The statement 

yj^errok ; 

in an action resets the parser to its normal mode. The last example can be 
rewritten as follows: 
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r 



error '\n' 




(void) printf ( 'Tteenter last line: 



); 



input 



$$ = $4; 





As previously mentioned, the token seen immediately after the error sym- 
bol is the input token at which the error was discovered. Sometimes, this is 
inappropriate; for example, an error recovery action might take upon itself the 
job of finding the correct place to resume input. In this case, the previous 
look-ahead token must be cleared. The statement 

yyclearin ; 

in an action will have this effect. For example, suppose the action after error 
were to call some sophisticated res)aichronization routine (supplied by the 
user) that attempted to advance the input to the beginning of the next valid 
statement. After this routine is called, the next token returned by yylex is 
presumably the first token in a legal statement. The old illegal token must be 
discarded and the error state reset. A rule similar to the following example 
could perform this. 
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Stat : error 
{ 

resynchO; 
yyerrdk ; 
yyclearin; 

} 



These mechanisms are admittedly crude but do allow for a simple, fairly 
effective recovery of the parser from many errors. Moreover, the user can get 
control to deal with the error actions required by other portions of the pro- 
gram. 
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When the user inputs a specification to yacc, the output is a file of C 
language subroutines, called y.tab.c. The function produced by yacc is called 
yyparse(); it is an integer-valued function. When it is called, it in turn repeat- 
edly calls yylex(), the lexical analyzer supplied by the user (see "Lexical 
Analysis"), to obtain input tokens. Eventually, an error is detected, yyparse() 
returns the value 1, and no error recovery is possible, or the lexical analyzer 
returns the end-marker token and the parser accepts. In this case, yyparse() 
returns the value 0. 

The user must provide a certain amount of environment for this parser in 
order to obtain a working program. For example, as with every C language 
program, a routine called main() must be defined that eventually calls 
yparse(). In addition, a routine called yyerror() is needed to print a message 
when a syntax error is detected. 

These two routines must be supplied in one form or another by the user. 
To ease the initial effort of using yacc, a library has been provided with 
default versions of main() and yerror(). The library is accessed by a -ly argu- 
ment to the cc(l) command or to the loader. The source codes 

mainC ) 
{ 

return (yyparseO); 

} 

and 

# include <stdio.h> 

y5^error(s) 

char *s; 

{ 

(void) fprintf (stderr, "%s\n", s); 

} 

show the triviality of these default programs. The argument to yerror() is a 
string containing an error message, usually the string syntax error. The aver- 
age application wants to do better than this. Ordinarily, the program should 
keep track of the input line number and print it along with the message when 
a S)nitax error is detected. The external integer variable yychar contains the 
look-ahead token number at the time the error was detected. This may be of 
some interest in giving better diagnostics. Since the main() routine is 
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probably supplied by the user (to read arguments, etc.)/ the yacc library is 
useful only in small projects or in the earliest stages of larger ones. 

The external integer variable yydebug is normally set to 0. If it is set to a 
nonzero value, the parser will output a verbose description of its actions, 
including a discussion of the input symbols read and what the parser actions 
are. It is possible to set this variable by using sdb. 
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This part contains miscellaneous hints on preparing efficient, easily 
changeable, and clear specifications. The individual subsections are more or 
less independent. 



input Style 

It is difficult to provide rules with substantial actions and still have a read- 
able specification file. The following are a few style hints. 

1 . Use all uppercase letters for token names and all lowercase letters for 
nonterminal names. This is useful in debugging. 

2. Put grammar rules and actions on separate lines. It makes editing 
easier. 

3. Put all rules with the same left side together. Put the left side in only 
once and let all following rules begin with a vertical bar. 

4. Put a semicolon only after the last rule with a given left-hand side 
and put the semicolon on a separate line. This allows new rules to be 
easily added. 

5 . Indent rule bodies by one tab stop and action bodies by two tab stops. 

6. Put complicated actions into subroutines defined in separate files. 

Example 1 is written following this style, as are the examples in this sec- 
tion (where space permits). The user must decide about these stylistic ques- 
tions. The central problem, however, is to make the rules visible through the 
morass of action code. 



Left Recursion 

The algorithm used by the yacc parser encourages so called left recursive 
grammar rules. Rules of the form 

name : name rest_ofj:rule ; 

match this algorithm. These rules such as 
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list : item 

I list * , ' item 



and 

seq : Item 

I seq item 

frequently arise when writing specifications of sequences and lists. In each of 
these cases, the first rule will be reduced for the first item only, and the 
second rule will be reduced for the second and all succeeding items. 

With right recursive rules, such as 

seq : iliem 

I item seq 
5 

the parser is a bit bigger, and the items are seen and reduced from right to 
left. More seriously, an internal stack in the parser is in danger of overflowing 
if a very long sequence is read. Thus, the user should use left recursion wher- 
ever reasonable. 

It is worth considering if a sequence with zero elements has any meaning, 
and if so, consider writing the sequence specification as 

seq : /* ernpty */ 
I seq item 

using an empty rule. Once again, the first rule would always be reduced 
exactly once before the first item was read, and then the second rule would be 
reduced once for each item read. Permitting empty sequences often leads to 
increased generality. However, conflicts might arise if yacc is asked to decide 
which empty sequence it has seen when it has not seen enough to know! 
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Lexical Tie-Ins 

Some lexical decisions depend on context. For example, the lexical 
analyzer might want to delete blanks normally, but not within quoted strings, 
or names might be entered into a symbol table in declarations but not in 
expressions. One way of handling these situations is to create a global flag 
that is examined by the lexical analyzer and set by actions. For example. 




int dflag; 

%} 

. . . other declaratioois . . . 



prog : decls stats 



decls : /* enpty ♦/ 
{ 

dflag = 1; 

} 

[ decls declaration 



stats : /♦ em p t y ♦/ 
{ 

d£lag = 0; 

} 

I stats statement 



. other rules . 




Specifies a program that consists of zero or more declarations followed by zero 
or more statements. The flag dflag is now when reading statements and 1 
when reading declarations, except for the first token in the first statement. 



6-36 PROGRAMMER'S GUIDE 



Hints for Preparing Specifications 



This token must be seen by the parser before it can tell that the declaration 
section has ended and the statements have begun. In many cases, this single 
token exception does not affect the lexical scan. 

This kind of back-door approach can be elaborated to a noxious degree. 
Nevertheless, it represents a v^ay of doing some things that are difficult, if not 
impossible, to do otherwise. 



Reserved Words 

Some programming languages permit you to use words like if, which are 
normally reserved as label or variable names, provided that such use does not 
conflict with the legal use of these names in the programming language. This 
is extremely hard to do in the framework of yacc. It is difficult to pass infor- 
mation to the lexical analyzer telling it this instance of if is a keyword and 
that instance is a variable. The user can make a stab at it using the mechan- 
ism described in the last subsection, but it is difficult. 

A number of ways of making this easier are under advisement. Until 
then, it is better that the keywords be reserved, i.e., forbidden for use as vari- 
able names. There are powerful stylistic reasons for preferring this. 
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This part discusses a number of advanced features of yacc. 

Simulating error and accept in Actions 

The parsing actions of error and accept can be simulated in an action by 
use of macros YYACCEPT and YYERROR. The YYACCEPT macro causes 
yyparseO to return the value 0; YYERROR causes the parser to behave as if the 
current input symbol had been a syntax error; yyerror() is called, and error 
recovery takes place. These mechanisms can be used to simulate parsers with 
multiple end-markers or context sensitive syntax checking. 

Accessing Values in Enclosing Rules 

An action may refer to values returned by actions to the left of the current 
rule. The mechanism is simply the same as with ordinary actions, a dollar 
sign followed by a digit. 
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adj noun verb adj noun 




sent 



{ 



look at the sentence . . . 



} 



adj 



{ 



$$ = THE; 



} 



YCXJtG 



{ 



$$ = TODNG; 



noun : DOG 
{ 

$$ = DOG; 

} 

I CRGNE 
{ 

if ( $0 == TOUNG ) 
{ 

(void) printf{ "v*iat?\n" ); 

} 

$$ = CEO^; 

} 



In this case, the digit may be or negative. In the action following the 
word CRONE, a check is made that the preceding token shifted was not 
YOUNG. Obviously, this is only possible when a great deal is known about 
what might precede the symbol noun in the input. There is also a distinctly 
unstructured flavor about this. Nevertheless, at times this mechanism 
prevents a great deal of trouble, especially when a few combinations are to be 
excluded from an otherwise regular structure. 
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Support for Arbitrary Value Types 

By default, the values returned by actions and the lexical analyzer are 
integers, yacc can also support values of other types, including structures. In 
addition, yacc keeps track of the types and inserts appropriate union member 
names so that the resulting parser is strictly type checked, yacc value stack is 
declared to be a union of the various types of values desired. The user 
declares the union and associates union member names with each token and 
nonterminal symbol having a value. When the value is referenced through a 
$$ or $n construction, yacc will automatically insert the appropriate union 
name so that no unwanted conversions take place. In addition, type checking 
commands such as lint are far more silent. 

There are three mechanisms used to provide for this typing. First, there is 
a way of defining the union. This must be done by the user since other sub- 
routines, notably the lexical analyzer, must know about the union member 
names. Second, there is a way of associating a union member name with 
tokens and nonterminals. Finally, there is a mechanism for describing the 
type of those few values where yacc cannot easily determine the type. 

To declare the union, the user includes 

^^mion 
{ 

body of union . . . 

} 

in the declaration section. This declares the yacc value stack and the external 
variables yylval and yyval to have type equal to this union. If yacc was 
invoked with the -d option, the union declaration is copied onto the y.tab.h 
file as YYSTYPE. 

Once YYSTYPE is defined, the union member names must be associated 
with the various terminal and nonterminal names. The construction 

<nanie> 

is used to indicate a union member name. If this follows one of the keywords 
%token, %left, %right, or %nonassoc, the union member name is associated 
with the tokens listed. Thus, saying 

%left <optype> '+' 
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causes any reference to values returned by these two tokens to be tagged with 
the union member name optype. Another keyword, %type, is used to associ- 
ate union member names with nonterminals. Thus, one might say 

%type <nodetype> expr stat 

to associate the union member nodetype with the nonterminal symbols expr 
and Stat. 

There remain a couple of cases where these mechanisms are insufficient. 
If there is an action within a rule, the value returned by this action has no a 
priori type. Similarly, reference to left context values (such as $0) leaves yacc 
with no easy way of knowing the type. In this case, a type can be imposed 
on the reference by inserting a union member name between < and > 
immediately after the first $. The example 




rule : aaa 

{ 



$<intval>$ = 3; 

} 

Isbb 



{ 

fun{ $<intval>2, $<other>0 ); 

} 




shows this usage. This syntax has little to recommend it, but the situation 
arises rarely. 

A sample specification is given in Example 2. The facilities in this subsec- 
tion are not triggered until they are used. In particular, the use of %type will 
turn on these mechanisms. When they are used, there is a fairly strict level of 
checking. For example, use of $n or $$ to refer to something with no defined 
type is diagnosed. If these facilities are not triggered, the yacc value stack is 
used to hold ints. 
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yacc Input Syntax 

This section has a description of the yacc input syntax as a yacc specifica- 
tion. Context dependencies, etc. are not considered. Ironically, although yacc 
accepts an LALR(l) grammar, the yacc input specification language is most 
naturally specified as an LR(2) grammar; the sticky part comes when an iden- 
tifier is seen in a rule immediately following an action. If this identifier is fol- 
lowed by a colon, it is the start of the next rule; otherwise, it is a continuation 
of the current rule, which just happens to have an action embedded in it. As 
implemented, the lexical analyzer looks ahead after seeing an identifier and 
decides whether the next token (skipping blanks, newlines, and comments, 
etc.) is a colon. If so, it returns the token CUDENTIFIER. Otherwise, it 
returns IDENTIFIER. Literals (quoted strings) are also returned as IDENTIF- 
IERS but never as part of C_JDENTIFIERs. 



/♦ gramnar for the input to yacc */ 




/♦ basic entries */ 

%tdkmi IDENTIFIER /♦ includes identifiers and literals V 

%tdken CJtDENEIFEER /* identifier (but not literal) followed by a : */ 

%tc3cen NUMBER /* [0-9]+ ♦/ 

/♦ reserved wsrds: %type=>T5fEE %left=>LEET,etc, ♦/ 

Xtoken LEPP KEOCT NONASSCX: TOKEN ERBC T£FE SlSftRT UNION 

%token MARK /* the 906 mark ♦/ 
%token DCURL /* the %{ mark ♦/ 
56tdken ECURL /* the %} mark ♦/ 

/♦ ASCII character literals stand for theneelves ♦/ 
%token spec 



spec : def s MARK rules tail 
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continued 



tail : MARK 
{ 

In this action, eat up the rest of the £ile 

} 

I /* e mp t y ; the secxaid MARK is optional */ 



def s : /♦ enopty */ 

I defs def 
f 

def : STAE^ JDENTIFIER 

I UNION 



Gqpyr union definition to output 



I VJJBL 
{ 

} 

FCURL 
I rwDrd tag nlist 



Copy C oode to output file 



tag 



nlist 



LEFT 
RIQir 
NONASSCX: 
TYPE 



/* enpty: tmion tag is optional ♦/ 



nmno 

nlist nmno 
nlist ' , ' nnirK> 
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r 



continued 




IDliNi'JLb'lER 



/♦ Nbte: literal illegal with % type */ 



I HSNTIFIER NIMBER /* Note: illegal vdth % type */ 



/* rule section */ 

rules : CJEDEUPrxt'IER rbody prec 

I rules rule 
I 

rule : CJDDEWrrFIER rbody prec 

I ' r riXDdy prec 



rbody ; /♦ enipty ♦/ 

I rbody lUkNi'lb'lER 
I rbody act 



act : '{' 



{ 



Oopy action translate $$ etc. 



} 

'}' 



prec 



/* enpty ♦/ 
PREC TDSNTIFIIR 
PRBC IDEl/nFrEK act 
prec 
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1 . A Simple Example 

This example gives the complete yacc applications for a small desk calcu- 
lator; the calculator has 26 registers labeled a through z and accepts arithmetic 
expressions made up of the operators 

+ , — , *, /, % (rood operator), & (bitwise and), | (bitwise or), 
and assignments. 

If an expression at the top level is an assignment, only the assignment is done; 
otherwise, the expression is printed. As in the C language, an integer that 
begins with (zero) is assumed to be octal; otherwise, it is assumed to be 
decimal. 

As an example of a yacc specification, the desk calculator does a reason- 
able job of showing how precedence and ambiguities are used and demon- 
strates simple recovery. The major oversimplifications are that the lexical 
analyzer is much simpler than for most applications, and the output is pro- 
duced immediately line by line. Note the way that decimal and octal integers 
are read in by grammar rules. This job is probably better done by the lexical 
analyzer. 




# include <stdio.h> 



# include <ctype.h> 

int regs[26]; 
int base; 

%} 

Xstart list 

%token DIGIT LETTER 

jaeft 'r 

Xleft 

5aeft 




yacc 6-45 



Examples 



continued 



%lef t UMINUS /* supplies precedence for unary minus V 
906 /* beginning of rules section */ 



list 



Stat 



/* enpty */ 
list Stat '\n' 
list error '\n' 

yyerrok; 



expr 

(void) printf( "96d\n", $1 ); 
LETTER '=' expr 
regs[$1] = $3; 

e3cpr 

$$ = $2; 
expr expr 

$$ = $1 + $3; 
expr ' expr 

$$ = $1 - $3; 
expr '♦' expr 

$$ = $1 * $3; 
eaqsr '/' expr 

$$ = $1 / $3; 
exp '%' expr 
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$$ = $1 96 $3; 
expr expr 
$$ = $1 & $3; 



} 

I expr ' I ' expr 
{ 

$$ = $1 I $3; 

} 

I expr 5^ec aOJOS 

{ 

$$ = -$2; 

} 

I LETITER 
{ 

$$ = reg[$1]; 

} 

I nuiDiber 



DIGIT 
{ 

$$ = $1; base = ($1=0) ? 8 ; 10; 

} 

I nmrber DIGIT 
{ 

$$ = base * $1 + $2; 

} 
* 

/* beginning of subroutines section */ 

ant yylex( ) /* lexical analysis routine */ 
{ /* return LETTER for lowercase letter, ♦/ 

/♦ yylval = through 25 ♦/ 

/* returns DIGIT for digit, yylval = through 9 */ 
/* clLI other characters are returned immediately */ 
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int c; 

/•sfcLp blanks*/ 
vihile ( (c = getchar( ) ) == ' ' ) 



/* c is now nonblaiik ♦/ 

if (isloMer(c)) 
{ 

yylval = c - 'a'; 
return (LETTIER); 

} 

if (isdigit(c)) 
} 

yylval = c - '0' ; 
return (DIGIT); 



} 

return (c); 



2. An Advanced Example 

This section gives an example of a grammar using some of the advanced 
features. The desk calculator example in Example 1 is modified to provide a 
desk calculator that does floating point interval arithmetic. The calculator 
understands floating point constants; the arithmetic operations +, *, /, and 
unary - a through z. Moreover, it also understands intervals written 

(X,Y) 

where X is less than or equal to Y. There are 26 interval valued variables A 
through Z that may also be used. The usage is similar to that in Example 1; 
assignments return no value and print nothing while expressions print the 
(floating or interval) value. 
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This example explores a number of interesting features of yacc and C. 
Intervals are represented by a structure consisting of the left and right end- 
point values stored as doubles. This structure is given a type name, INTER- 
VAL, by using typedef . The yacc value stack can also contain floating point 
scalars and integers (used to index into the arrays holding the variable values). 
Notice that the entire strategy depends strongly on being able to assign struc- 
tures and unions in C language. In fact, many of the actions call functions 
that return structures as well. 

It is also worth noting the use of YYERROR to handle error conditions — 
division by an interval containing and an interval presented in the wrong 
order. The error recovery mechanism of yacc is used to throw away the rest 
of the offending line. 

In addition to the mixing of types on the value stack, this grammar also 
demonstrates an interesting use of syntax to keep track of the type (for exam- 
ple, scalar or interval) of intermediate expressions. Note that scalar can be 
automatically promoted to an interval if the context demands an interval 
value. This causes a large number of conflicts when the grammar is run 
through yacc: 18 shift-reduce and 26 reduce-reduce. The problem can be 
seen by looking at the two input lines. 

2.5 + (3,5 - 4.) 

and 

2.5 + (3.5, 4) 

Notice that the 2.5 is to be used in an interval value expression in the 
second example, but this fact is not known until the comma is read. By this 
time, 2.5 is finished, and the parser cannot go back and change its mind. 
More generally, it might be necessary to look ahead an arbitrary number of 
tokens to decide whether to convert a scalar to an interval. This problem is 
evaded by having two rules for each binary interval valued operator-one 
when the left operand is a scalar and one when the left operand is an interval. 
In the second case, the right operand must be an interval, so the conversion 
will be applied automatically. Despite this evasion, there are still many cases 
where the conversion may be applied or not, leading to the above conflicts. 
They are resolved by listing the rules that yield scalars first in the specification 
file; in this way, the conflict will be resolved in the direction of keeping scalar 
valued expressions scalar valued, until they are forced to become intervals. 
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This way of handling multiple types is very instructive. If there were 
many kinds of expression types instead of just two, the number of rules 
needed would increase dramatically and the conflicts even more dramatically. 
Thus, while this example is instructive, it is better practice in a more normal 
programming language environment to keep the type information as part of 
the value and not as part of the grammar. 

Finally, a word about the lexical analysis. The only unusual feature is the 
treatment of floating point constants. The C language library routine atofO is 
used to do the actual conversion from a character string to a double-precision 
value. If the lexical analyzer detects an error, it responds by returning a token 
that is illegal in the grammar provoking a syntax error in the parser and 
thence error recovery. 



#iiiclude <stdio.h^ 
#ijicl\3de <ctype,h?' 

typedef struct interval 
{ 

double lo, hi; 
} Ttrm^AL; 

INTOFVAL viiul( ) , vdiv( ) ; 

double atofO; 

double dreg[26]; 
nnSFVAL vreg[26]; 

%} 

9£start line 

%an±an 
{ 

ant ival; 
double dvcLl; 
mnKVAL wal; 



%to3cen <ival> IMG VRE3G /♦ indices into dreg, vreg arrsys V 




%{ 



} 
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%tdken <dval> CTNST 
%type <dval> dexp 
%type <vval> vejq> 



/♦ floating point constant */ 

/* expression */ 

/♦ interval expiression */ 



/♦ precedence infarmation abcut the qperators ♦/ 



9aeft '+' '-' 



9aeft 



lines 
line 



%lef t UMDJOS /* precedence for unary miniis V 



/* beginning of rules section */ 



: /* eitpty */ 
I lines line 



deaqp '\n' 



(void) printf("%15.8f\n",$1); 



I vexp '\n' 
{ 



(void) printf("(%15.8f, %15.8f)Nn", $1.1o, $1.hi); 



ERBG dexp '\n' 

dreg[$1] = $3; 

VBEG ve^ '\n' 
vreg[$1] = $3; 

e r ror '\n' 

y yer rote; 



6exp 



CONST 
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I mm 

$$ = dreg[$1]; 
I dexp dexp 

$$ = $1 + $3; 
I dexp dexp 

$$ = $1 - $3; 
I dsxp dexp 

$$=$!♦ $3; 
I dexp V dexp 

$$ = $1 / $3; 
I dexp %pcec UMINUS 

$$ = -$2; 
I '(' dexp')' 

$$ = $2; 



$$.hi = $$.lo = $1; 

' ( ' dexp ' , ' dexp ' ) ' 

$$.lo = $2; 
$$,hi = $4; 
if ( $$.lo > $$.hi ) 



dexp 
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(void) printf ("interval oat of order \n"); 
YYEE^ROR; 



I vexp ^rex£> 

$$.hi = $l.hi + $3,hi; 
$$.lo = $1.1o + $3.1o; 

I dexp vesq} 

$$.hi = $1 + $3.hi; 
$$.lo = $1 + $3.1o; 

I vexp ' vexp 

$$.hi = $1.hi - $3.1o; 
$$.lo = $1.1o - $3.hi; 

I dvep vdep 

$$.hi = $1 - $3.1o; 
$$.lo = $1 - $3.hi 

1 veicp vexp 

$$ = vim2l( $1,lo,$.hi,$3 ) 
I de^ vexp 

$$ = vimiK $1, $1, $3 ) 
I vexp V veaqp 



VREG 



$$ = vreg[$1]; 
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{ 

i£( dcheck( $3 ) ) YYEKRCR; 
$$ = vdiv( $1.lo, $1.hi, $3 ) 

} 

I dBxp '/' vesqp 
{ 

if ( dcheck( $3 ) ) YYESRCR; 
$$ = v6±v( $1.lo, $1.hi, $3 ) 

} 

I '— ' veKp %pcec IMDJUS 
{ 

$$.hi = -$2.1o;$$.lo = -$2.hi 

} 

I vexp ')' 

} 

$$ = $2 

} 



9fl% /* beginning of subroutixies secticcn */ 

# define BSZ 50 /* buffer size for floating point number V 
/♦ lexical analysis */ 



int yylexi ) 
{ 



register int c; 

/* skip over blanks */ 
\ib±le ((c = getcharO) = ' ') 

if (isiq3per(c) ) 
{ 

yylval.ival = c — 'A' 
return (VREG); 

} 

if (islower(c)) 
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{ 



yylval.ival = c — 'a', 
retum( WEG }; 

} 

/♦ got]i>le up digits, points, eaqpoxients ♦/ 

if (isdigit(c) | | c == '.') 
{ 

char buf [BSZ+1], *cp = buf; 
int dot = 0, exp » 0; 

for(; (cp - bof) < BSZ ; ++cp, c = getcharO) 
{ 



*cp = c; 
if (isdigitCc)) 

oontinue; 
if (c == '.') 

{ 

if (dot++ 1 1 exp) 
return ('.'); /* will cause syntax error */ 
oontinue; 

} 

if ( c = 'e') 
{ 

if (exp++) 

return ( 'e' ) ; /* vdll cause syntax error */ 
continue; 

} 

/* end of number */ 



*cp= ' 

if (CP - buf >« BSZ) 

(void) printf ("constant too long — truncated\n" ) ; 
else 

ungetc(c, stdin); /* push back last char read ♦/ 
yylval.dval = atof(buf); 
return (CX3NST) ; 



break; 



} 



} 
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return (c); 



} 

hilo(a, b, c» d) 



double a, b, c, d; 

{ 



/* returns the smallest interval oontaining a, b, c, and d V 

/* used by *,/ routine ♦/ 
nmRVAL v; 

if (a > b) 
{ 

v.hi = a; 
v.lo = b; 



} 

else 
{ 



v.hi = b; 
v.lo = a; 



} 

if (c > d) 
{ 



if 



(c > v.hi) 
v.hi = c; 

(d < v.lo) 
v.lo = d; 



if 



} 

else 
} 



if 



(d > v.hi) 
v.hi = d; 
(c < v.lo) 
v.lo = c; 



if 



} 

return (v) ; 



} 
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vmiil(a, b, V) 

double a, b; 
INIERVM. v; 

{ 

return (hilo(a ♦ v.hi, a * v,lo, b * v.hi, b * v.lo)); 

} 

dcheck(v) 

rNTERVMi v; 

{ 

if {v.hi >= 0. S& v.lo <= 0.) 
{ 

(void) printf ("divisor interval cxaitains 0,\n"); 
return (1); 

} 

return (0); 

{ 

INTERVAL 
vdiv(a, b, v) 

double a, b; 

Hn^EKVAL v; 

{ 

return (hilo(a / v.hi, a / v,lo, b / v.hi, b / v.lo)); 

} 
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Introduction 



Both mandatory and advisory file and record locking are available on 
current releases of the UNIX System. This capability is intended to provide a 
synchronization mechanism for programs accessing the same stores of data 
simultaneously. Such processing is characteristic of many multiuser applica- 
tions, and the need for a standard method of dealing with the problem has 
been recognized by standards advocates like /u$r/ group, an organization of 
UNIX System users from businesses and campuses across the country. 

Advisory file and record locking can be used to coordinate self- 
synchronizing processes. In mandatory locking, the standard I/O subroutines 
and I/O system calls enforce the locking protocol. In this way, at the cost of 
a little efficiency, mandatory locking double-checks the programs against 
accessing the data out of sequence. 

The remainder of this chapter describes how file and record locking capa- 
bilities can be used. Examples are given for the correct use of record locking. 
Misconceptions about the amount of protection that record locking affords are 
dispelled. Record locking should be viewed as a synchronization mechanism, 
not a security mechanism. 

The manual pages for the fcntl(2) system call, the lockf(3C) library func- 
tion, and fcntl(5) data structures and commands are referred to throughout 
this section. You should read them in the Programmer's Reference Manual 
before continuing. 
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Before discussing how record locking should be used, let us first define a 
few terms. 

Record 

This is a contiguous set of bytes in a file. The UNIX Operating Sys- 
tem does not impose any record structure on files. This may be done 
by the programs that use the files. 

Cooperating Processes 

These are processes that work together in some well-defined fashion 
to accomplish the tasks at hand. Processes that share files must 
request permission to access the files before using them. File access 
permissions must be carefully set to restrict non-cooperating processes 
from accessing those files. The term process will be used interchange- 
ably with cooperating process to refer to a task obeying such proto- 
cols. 

Read (Share) Locks 

These are used to gain limited access to sections of files. When a read 
lock is in place on a record, other processes may also read lock that 
record, in whole or in part. No other process, however, may have or 
obtain a write lock on an overlapping section of the file. If a process 
holds a read lock, it may assume that no other process will be writing 
or updating that record at the same time. This access method also 
permits many processes to read the given record. This might be 
necessary when searching a file, without the contention involved if a 
write or exclusive lock were to be used. 

Write (Exclusive) Locks 

These are used to gain complete control over sections of files. When a 
write lock is in place on a record, no other process may read or write 
lock that record, in whole or in part. If a process holds a write lock, it 
may assume that no other process will be reading or writing that 
record at the same time. 

Advisory Locking 

This is a form of record locking that does not interact with the I/O 
subsystem [that is, creat(2), open(2), read(2), and write(2)]. The con- 
trol over records is accomplished by requiring an appropriate record 
lock request before I/O operations. If appropriate requests are always 
made by all processes accessing the file, then the accessibility of the 
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file will be controlled by the interaction of these requests. Advisory 
locking depends on the individual processes to enforce the record 
locking protocol; it does not require an accessibility check at the time 
of each I/O request. 

Mandatory Locking 

This is a form of record locking that does interact with the I/O sub- 
system. Access to locked records is enforced by the creat(2), open(2), 
read(2), and write(2) system calls. If a record is locked, then access of 
that record by any other process is restricted according to the tj^e of 
lock on the record. The control over records should still be performed 
explicitly by requesting an appropriate record lock before I/O opera- 
tions, but an additional check is made by the system before each I/O 
operation to ensure the record locking protocol is being honored. 
Mandatory locking offers an extra synchronization check, but at the 
cost of some additional system overhead. 
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File Protection 

There are access permissions for UNIX System files to control who may 
read, write, or execute such a file. These access permissions may only be set 
by the owner of the file or by the super-user. The permissions of the direc- 
tory in which the file resides can also affect the ultimate disposition of a file. 
Note that if the directory permissions allow anyone to write in it, then files 
within the directory may be removed, even if those files do not have read, 
write, or execute permission for that user. Any information that is worth pro- 
tecting is worth protecting properly. If your application warrants the use of 
record locking, make sure that the permissions on your files and directories 
are set properly. A record lock, even a mandatory record lock, will only pro- 
tect the portions of the files that are locked. Other parts of these files might 
be corrupted if proper precautions are not taken. 

Only a known set of programs and/or administrators should be able to 
read or write a database. This can be done easily by setting the set-group-ID 
bit [see chmod(l) in the User's/System Administrator's Reference Manual] of the 
database accessing programs. The files can then be accessed by a known set 
of programs that obey the record locking protocol. An example of such file 
protection, although record locking is not used, is the mail(l) command. In 
that command only the particular user and the mail command can read and 
write in the unread mail files. 



Opening a File for Record Loclcing 

The first requirement for locking a file or segment of a file is having a 
valid open file descriptor. If read locks are to be done, then the file must be 
opened with at least read accessibility, and the same is true for write locks and 
write accessibility. For our example we will open our file for both read and 
vmte access: 
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#incl\]de <stdio.h> 
#±nclude <ermo.h> 
#iiiclude <fcntl.h> 



int fd; /♦ file descriptor V 
char filename; 

iiain(argc, argv) 
int argc; 
char ♦argvC]; 
{ 

extern void exit( ) , perror( ) ; 

/« get database file name fran oonnond line and open the 
* file for read and write access. 
*/ 

if (argc < 2) { 

(void) fprintf (stderr, "usage: %s f ilenaroeVn" , argv[0]); 

exit(2); 

> 

filename = argv[1]; 

fd = ppen{ filename, q_RDWR) ; 

if (fd < 0) { 

perzorC filename) ; 

exit(2); 

} 



The file is now open for us to perform both locking and I/O functions. 
We then proceed with the task of setting a lock. 
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Setting a File Locic 

There are several ways for us to set a lock on a file. In part, these 
methods depend upon how the lock interacts with the rest of the program. 
There are also questions of performance as well as portability. Two methods 
will be given here, one using the fcntl(2) system call, the other using the 
/usr /group standards compatible lockf(3) library function call. 

Locking an entire file is just a special case of record locking. For both 
these methods the concept and the effect of the lock are the same. The file is 
locked starting at a byte offset of zero (0) and ending at the maximum file 
size. This point extends beyond any real end of the file so that no lock can be 
placed on this file beyond this point. To do this the value of the size of the 
lock is set to zero. The code using the fcntl(2) system call is as follows: 
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#define MAXJERYIO 
int try; 

struct flock Ick; 
try = 0; 

/♦ set \3p the record locking structure, the address of vdiich 

* is passed to the fcntl system Ccdl. 
*/ 

Ick.ljbype = F_WR1XK;/* setting a write lock V 
lck.l_v*ience = 0;/* offset l_start from beginning of file */ 
lck.l_start = OL; 

lcfc.l_len = OL;/* until the end of the file address space */ 

/* Attenpt locking MAXjraY times before giving up. 
♦/ 

vdiile (fcntl(fd, F_SEniiC, SJck) < 0) { 

if (ermo == EflGAIN 1 1 ermo == EAOCES) { 

/♦ there might be other errors cases in vdiich 

* you mig^t try again. 
V 

if (++try < MAXJIRY) { 
(void) sleep(2); 
oontinue; 
> 

(void) fprintf(stderr,"File busy try again later I \n"); 

return; 

} 

perrar{ "fcntl"); 

exit(2); 

} 
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This portion of code tries to lock a file. This task is attempted several 
times until one of the following things happens: 

■ the file is locked 

■ an error occurs 

■ it gives up trying because MA)(_TRY has been exceeded. 

To perform the same task using the lock£(3C) function, the code is as fol- 
lows: 
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#clefinfi MfiXJUOflO 
int try; 
try = 0; 

/* make sure the file pointer 

* is at the beginniiig of the file. 
V 

lseek(fd, OL, 0); 

/* Attenpt locldng MAXJTRY times before giving up. 
V 

vdiile (lockf(fd, FTLOOC, OL) < 0) { 

if (ermo == EAGAIN 1 1 ermo == E/MXES) { 

/♦ there might be other errors cases in viiich 

* you might try again. 
*/ 

if (++try < MAXJTOY) { 

sleep(2); 

oontinue; 

} 

(void) fprintf{stderr,"File busy try again later I \n"); 

return; 

} 

perror( "locfcf " ) ; 

exit(2); 

} 




It should be noted that the lockf(3C) example appears to be simpler, but 
the fcntl(2) example exhibits additional flexibility. Using the £cntl(2) method, 
it is possible to set the type and start of the lock request simply by setting a 
few structure variables. lock£(3C) merely sets write (exclusive) locks; an addi- 
tional system call [lseek(2)] is required to specify the start of the lock. 
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Setting and Removing Record Locics 

Locking a record is done the same way as locking a file except for the 
differing starting point and length of the lock. We will now try to solve an 
interesting and real problem. There are two records (these records may be in 
the same or different file) that must be updated simultaneously so that other 
processes get a consistent view of this information. (This type of problem 
comes up, for example, when updating the interrecord pointers in a doubly 
linked list.) To do this you must decide the following questions: 

■ What do you want to lock? 

■ For multiple locks, in what order do you want to lock and unlock the 
records? 

■ What do you do if you succeed in getting all the required locks? 

■ What do you do if you fail to get all the locks? 

In managing record locks, you must plan a failure strategy if one cannot 
obtain all the required locks. It is because of contention for these records that 
we have decided to use record locking in the first place. Different programs 
nught do the following: 

■ wait a certain amount of time and try again 

■ abort the procedure and warn the user 

■ let the process sleep until signaled that the lock has been freed 

■ some combination of the above. 

Let us now look at our example of inserting an entry into a doubly linked 
list. For the example, we will assume that the record after which the new 
record is to be inserted has a read lock on it already. The lock on this record 
must be changed or promoted to a vmte lock so that the record may be 
edited. 

Promoting a lock (generally from read lock to write lock) is permitted if no 
other process is holding a read lock in the same section of the file. If there are 
processes with pending write locks that are sleeping on the same section of 
the file, the lock promotion succeeds and the other (sleeping) locks wait. Pro- 
moting (or demoting) a write lock to a read lock carries no restrictions. In 
either case, the lock is merely reset with the new lock type. Because the 
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/usr/ group lockf function does not have read locks, lock promotion is not 
applicable to that call. An example of record locking with lock promotion fol- 
lows: 




,/* data portion of record */ 

long prev;/* djidex to previous record in the list V 
long neact;/* index to next record in the list V 

}; 

/♦ Lock prorootion using fcntl{2) 

* When this roirtine is entered it is assumed that there are read 

* locks on "here** and "next". 

* If write locks on "here" and "next" are oistained: 

* Set a write lock on "this" . 

* Return index to "this" record. 

* If any write lock is not obtained: 

* Restore read locks on "here" and ^•next". 

* Renowe all other locks. 

* Return a -1. 
♦/ 

long 

setaiock (this, here, next) 
long this, here, next; 
{ 

struct flock Ick; 

Ick.ljtype = F_WRLCK;/* setting a write lock V 
lck.l_Mhence = 0;/* offset l_start frcm beginning of file V 
lck.l_start = here; 
Ick.lJLen = sizeof (struct record); 

/* proniote lock on "here" to write lock */ 
if (fcntl(fd, FJSETUCW, &lck) < 0) { 
return (-1); 
} 

/♦ lock "this" with write lock ♦/ 
lck.l_start = this; 
if (fcntl(fd, FJSEELJCW, fiJLck) < 0) { 
/♦ Lock on "this" failed; 
* demote lock on "here" to read lock. 
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lck,l_start = here; 

(void) fcntl(fd, F_SBIIJCW, SJck) ; 

return (-1); 

} 

/* pranote lock on "next" to write lock ♦/ 

lcfc.l_start = next; 

if (fcntl(fd, F_SEELKW, &lck) < 0) { 

/♦ Lode on "next" failed; 

♦ demote lock on "here" to read lock, 

*/ 

Ick.ljtype = FJRDLCK; 
lck.l_start = here; 
(void) fcntl(fd, PJ5EELK, &lck); 
/* and renove lock on "this", 
*/ 

Ick.ljtype = FUNDCX; 

lck.l_start = this; 

(void) fcntl(fd, F_SEmc, 6Jck); 

return (-1);/* cannot set lock, try again or quit ♦/ 
) 

return (this); 



The locks on these three records were all set to wait (sleep) if another pro- 
cess was blocking them from being set. This was done with the F_SETLICW 
command. If the F_SETLK command was used instead, the fcntl system calls 
would fail if blocked. The program would then have to be changed to handle 
the blocked condition in each of the error return sections. 

Let us now look at a similar example using the lockf function. Since there 
are no read locks, all (write) locks will be referenced generically as locks. 
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/* Lock pronotion losing lodcf (3C) 

* When this routine is entered, it is assumed that there are 

* nD locks on "here" and "next". 

* If locks are ctotained: 

* Set a lock on "this". 

* Return index to "this" record, 

* If any lock is not obtained: 

* Rencve all other locks. 

* Return a -1. 
♦/ 

#include <unistd,h> 
long 

setaiock (this, here, next) 
long this, here, next; 



/* lock "here" V 

(void) lseek(fd, here, 0); 

if (lockfCfd, F_DCXac, sizeof (struct record)) < 0) { 

return (-1) ; 

} 

/♦ lock "this" */ 

(void) lseek(fd. this, 0); 

if (lockf(fd, F_LCXX, sizeof (struct record)) < 0) { 
/♦ Lock on "this" failed. 

* Clear lock on "here" . 

*/ 

(void) lseek(fd, here, 0); 

(void) lockf(fd, FJJLOCK, sizeof (struct record)); 
return (-1); 



/♦ lock "next" */ 

(void) lse€k(fd, next, 0); 

if (lockf(fd, F_LCXac, sizeof (struct record)) < 0) { 

/* Lock on "next" failed. 
♦ Clear lock on "here", 
♦/ 

(void) lseek(fd, here, 0); 



{ 



} 
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(void) locilcf(fd, FUliOCK, sizeof (struct recxard)); 

/* and remove lock en "this". 
♦/ 

(void) lse€k(fd, this, 0); 

(void) lockf(fd, FJJLCOC, sizeof (struct record) ) ; 
return (-1);/* caraiot set lock, try again or quit */ 

} 



return (this); 

} 




Locks are removed in the same manner as they are set, only the lock type 
is different (F_UNLCK or F_ULOCK). An unlock cannot be blocked by 
another process and will only affect locks that were placed by this process. 
The unlock only affects the section of the file defined in the previous example 
by Ick. It is possible to unlock or change the type of lock on a subsection of a 
previously set lock. This may cause an additional lock (two locks for one sys- 
tem call) to be used by the operating system. This occurs if the subsection is 
from the middle of the previously set lock. 



Getting Lock Information 

One can determine which processes, if any, are blocking a lock from being 
set. This can be used as a simple test or as a means to find locks on a file. A 
lock is set up as in the previous examples and the F-_GETLK command is used 
in the fcntl call. If the lock passed to fcntl would be blocked, the first block- 
ing lock is returned to the process through the structure passed to fcntl. That 
is, the lock data passed to fcntl is overwritten by blocking lock information. 
This information includes two pieces of data that have not been discussed yet, 
Lpid and Lsysid, that are only used by F„GETLK. (For systems that do not 
support a distributed architecture, the value in Lsysid should be ignored.) 
These fields uniquely identify the process holding the lock. 
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If a lock passed to f cntl using the F_GETLK command would not be 
blocked by another process' lock, then the Ltype field is changed to 
F_UNLCK, and the remaining fields in the structure are unaffected. Let us 
use this capability to print all the segments locked by other processes. Note 
that if there are several read locks over the same segment, only one of these 
will be found. 



struct flock Ick; 

/* Find and print "vnn.te lock" blocked segments of this file. */ 
(void) printf("sysid pid type start length\n"); 
lck.l_v4henoe = 0; 
lck.l_stai± = OL; 
lck.l_len = OL; 
do { 

lck,l_type = F_WRLCK; 

(void) fcntl(fd, FJSmc, Sick); 

if (Ick.ljtype != FJJNDCK) { 

(void) printf("%5d %5d 96c %8d 3ffld\n", 

lck.l_sysid, 

Ick.ljid, 

(Ick.ljtype == F_WRLCK) ? 'W : 'R', 

lck.l_start, 

lck.l_len); 

/* if this lock goes to the end of the address 

* space, no need to lock further, so break oat. 
*/ 

if (lck.l_len == 0) 
break; 

/* otherwise, look for new lock after the one 

♦ just found. 
*/ 

lck.l_start += lck.l_len; 
} 

} vtole (Ick.ljt^Fpe 1= FJUNDCK); 
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The fcntl function with the F_GETLK command will always return 
correctly (that is, it will not sleep or fail) if the values passed to it as argu- 
ments are valid. 

The lockf function with the F_TEST command can also be used to test if 
there is a process blocking a lock. This function does not, however, return the 
information about where the lock actually is and which process owns the lock. 
A routine using lockf to test for a lock on a file follows: 



/* find a blocked record, */ 

/* seek to beginning of file V 
(void) lseek(fd, 0, OL); 

/♦ set the size of the test region to zero (0) 
* to test until the end of the file address space. 
♦/ 

if (loc3cf(fd, P_TEST, OL) < 0) { 
switch (ermo) { 
case EAOCES: 
case EAGAIN: 

(void) printf("file is locked by another processNn" ) ; 

break; 

case EBADF: 

/* bad argument passed to lockf V 

perror( "lockf"); 

break; 

default: 

(void) printf ("lockf : unknown error <96d>\n", ermo); 

break; 

} 

) 



When a process forks, the child receives a copy of the file descriptors that 
the parent has opened. The parent and child also share a common file pointer 
for each file. If the parent were to seek to a point in the file, the child's file 
pointer would also be at that location. This feature has important implications 
when using record locking. The current value of the file pointer is used as the 
reference for the offset of the beginning of the lock, as described by 1— start. 
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when using an L.whence value of 1. If both the parent process and child 
process set locks on the same file, there is a possibility that a lock will be set 
using a file pointer that was reset by the other process. This problem appears 
in the lockf(3C) function call as well and is a result of the /usr/group require- 
ments for record locking. If forking is used in a record locking program, the 
child process should close and reopen the file if either locking method is used. 
This will result in the creation of a new and separate file pointer that can be 
manipulated without this problem occurring. Another solution is to use the 
fcnll system call with a L.whence value of or 2. This makes the locking 
function atomic, so even processes sharing file pointers can be locked without 
difficulty. 



Deadlock Handling 

There is a certain level of deadlock detection/avoidance built into the 
record locking facility. This deadlock handling provides the same level of pro- 
tection granted by the /usr/group standard lockf call. This deadlock detection 
is only valid for processes that are locking files or records on a single system. 
Deadlocks can only potentially occur when the system is about to put a record 
locking system call to sleep. A search is made for constraint loops of 
processes that would cause the system call to sleep indefinitely. If such a 
situation is found, the locking system call will fail and set errno to the 
deadlock error number. If a process wishes to avoid the use of the systems 
deadlock detection, it should set its locks using F_GETLK instead of 
F_GETLKW. 
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The use of mandatory locking is not recommended for reasons that will be 
made clear in a subsequent section. Whether or not locks are enforced by the 
I/O system calls is determined at the time the calls are made and by the state 
of the permissions on the file [see chmod(2)]. For locks to be under manda- 
tory enforcement, the file must be a regular file with the set-group-ID bit on 
and the group-execute permission off. If either condition fails, all record locks 
are advisory. Mandatory enforcement can be assured by the following code: 



if (stat(filenaiie, &buf ) < 0) { 
p er r o r( "program" ) ; 
exit (2); 
} 

/♦ get currently set mode */ 
mode = buf .st_mode; 

/* remove group execute permission from mode */ 

mode &= "'(SJtEXE)C»3); 

/* set 'set graop id bit' in mode V 

mode 1= S_I9C3ID; 

if (chnxxl( filename, mode) < 0) { 

perror { "program" ) ; 

exit(2); 

} 




#aiiclude <sys/tg^pes.h> 
#iiiclude <sys/stat.h> 



int mode; 
struct Stat buf; 
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Files that are to be record locked should never have any type of execute 
permission set on them. This is because the operating system does not obey 
the record locking protocol when executing a file. 

The chmod(l) command can also be easily used to set a file to have man- 
datory locking. This can be done with the command, 

chmod +1 filename 

The l8(l) command was also changed to show this setting when you ask for 
the long listing format: 

Is -1 filename 
causes the following to be printed: 

-rw — 1 1 abc other 1048576 Dec 3 11:44 filename 



Caveat Emptor — Mandatory Locking 

■ Mandatory locking only protects those portions of a file that are 
locked. Other portions of the file that are not locked may be accessed 
according to normal UNIX System file permissions. 

■ If multiple reads or writes are necessary for an atomic transaction, the 
process should explicitly lock all such pieces before any I/O begins. 
Thus advisory enforcement is sufficient for all programs that perform in 
this way, 

■ As stated earlier, arbitrary programs should not have unrestricted 
access permission to files that are important enough to record lock. 

■ Advisory locking is more efficient because a record lock check does not 
have to be performed for every I/O request. 
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Record Locking and Future Releases of the 
UNIX System 

Provisions have been made for file and record locking in a UNIX System 
environment. In such an environment the system on v^hich the locking pro- 
cess resides may be remote from the system on which the file and record locks 
reside. In this way multiple processes on different systems may put locks 
upon a single file that resides on one of these or yet another system. The 
record locks for a file reside on the system that maintains the file. It is also 
important to note that deadlock detection/avoidance is only determined by 
the record locks being held by and for a single system. Therefore, it is neces- 
sary that a process only hold record locks on a single system at any given 
time for the deadlock mechanism to be effective. If a process needs to main- 
tain locks over several systems, it is suggested that the process avoid the 
sleep-when-blocked features of fcntl or lockf and that the process maintain 
its own deadlock detection. If the process uses the sleep-when-blocked 
feature, then a timeout mechanism should be provided by the process so that 
it does not hang waiting for a lock to be cleared. 
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Introduction 



Efficient use of disk storage space, memory, and computing power is 
becoming increasingly important. A shared library can offer savings in all 
three areas. For example, if constructed properly, a shared library can make 
a*out files (executable object files) smaller on disk storage and processes (a.out 
files that are executing) smaller in memory. 

The first part of this chapter, " Using a Shared Library, " is designed to 
help you use UNIX System V shared libraries. It describes what a shared 
library is and how to use one to build a.out files. It also offers advice about 
when and when not to use a shared library and how to determine whether an 
a.out uses a shared library. 

The second part in this chapter, " Building a Shared Library, " describes 
how to build a shared library. You do not need to read this part to use shared 
libraries. It addresses library developers, advanced programmers who are 
expected to build their own shared libraries. Specifically, this part describes 
how to use the UNIX System tool mkshlib(l) (documented in the 
Programmer's Reference Manual) and how to write C code for shared libraries 
on a UNIX System. An example is included. This part also describes how to 
use the tool chkshlib(l), which helps you check the compatibility of versions 
of shared libraries. Read this part of the chapter only if you have to build a 
shared library. 



NOTE 



Shared libraries are a new feature of UNIX System V Release 3.0 and 
later. An executable object file that needs shared libraries will not run on 
previous releases of UNIX System V. 
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If you are accustomed to using libraries to build your applications pro- 
grams, shared libraries should blend into your work easily. This part of the 
chapter explains what shared libraries are and how and when to use them on 
the UNIX System. 



Wiiat is a Siiared Library? 

A shared library is a file containing object code that several a.out files 
may use simultaneously while executing. When a program is compiled or link 
edited with a shared library, the library code that defines the program's exter- 
nal references is not copied into the program's object file. Instead, a special 
section called .lib that identifies the library code is created in the object file. 
When the UNIX System executes the resulting a.out file, it uses the informa- 
tion in this section to bring the required shared library code into the address 
space of the process. 

The implementation behind these concepts is a shared library with two 
pieces. The first, called the host shared library, is an archive that the link edi- 
tor searches to resolve user references and to create the .lib section in a.out 
files. The structure and operation of this archive is the same as any archive 
without shared library members. For simplicity, however, in this chapter 
references to archives mean archive libraries without shared library members. 

The second part of a shared library is the target shared library. This is the 
file that the UNIX System uses when running a.out files built with the host 
shared library. It contains the actual code for the routines in the library. 
Naturally, it must be present on the the system where the a.out files will be 
run. 

A shared library offers several benefits by not copying code into a.out 
files. It can 

■ save disk storage space 

Because shared library code is not copied into all the a.out files that use 
the code, these files are smaller and use less disk space. 

■ save memory 

By sharing library code at run time, the dynamic memory needs of 
processes are reduced. 
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■ make executable files using library code easier to maintain 

As mentioned above, shared library code is brought into a process' 
address space at run time. Updating a shared library effectively 
updates all executable files that use the library, because the operating 
system brings the updated version into new processes. If an error in 
shared library code is fixed, all processes automatically use the 
corrected code. 

Archive libraries cannot, of course, offer this benefit: changes to 
archive libraries do not affect executable files, because code from the 
libraries is copied to the files during link editing, not during execution. 

"Deciding Whether to Use a Shared Library" in this chapter describes shared 
libraries in more detail. 



The UNIX System Shared Libraries 

Shared libraries are part of the SDS core. The networking library included 
with the Networking Support Utilities is also a shared library. Other shared 
libraries may be available now from software vendors and in the future from 
AT&T. 

Shared Host Library Target Library 

Library Command Line Option Pathname 



C Library -Ic— s /shlib/libc— s 

Networking Library -InsLs /shlib/libnsLs 

Notice the — s suffix on the library names; we use it to identify both host 
and target shared libraries. For example, it distinguishes the standard relocat- 
able C library libc from the shared C library libc_s. The _s also indicates 
that the libraries are statically linked. 

The relocatable C library is still available with releases of the C Program- 
ming Language Utilities; this library is searched by default during the compila- 
tion or link editing of C programs. All other archive libraries from previous 
releases of the system are also available. Just as you use the archive libraries' 
names, you must use a shared library's name when you want to use it to build 
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your a.out files. You tell the link editor its name with the -1 option, as shown 
below. 



Building an a.out File 

You direct the link editor to search a shared library the same way you 
direct a search of an archive library on the UNIX System: 

cc filex -ofile ... -llibran/^file ... 

To direct a search of the networking library, for example, you use the fol- 
lowing command line. 

cc filex -ofile ... -InsL-s ... 

And to link all the files in your current directory together with the shared 
C library you'd use the following command line: 

cc *.c -lc_s 

Normally, you should include the -lc_s argument after all other -1 argu- 
ments on a command line. The shared C library will then be treated like the 
relocatable C library, which is searched by default after all other libraries 
specified on a command line are searched. 

A shared library might be built with references to other shared libraries. 
That is, the first shared library might contain references to symbols that are 
resolved in a second shared library. In this case, both libraries must be given 
on the cc command line, in the order of the dependencies. 

For example, if the library libX.a references symbols in the shared C 
library, the command line would be as follows: 

cc *.c -lX_s -Ic— s 

Notice that the shared library containing the references to symbols must 
be listed on the command line before the shared library needed to resolve 
those references. For more information on inter-library dependencies, see the 
section "Referencing Symbols in a Shared Library from Another Shared 
Library" later in this chapter. 
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Coding an Application 

Application source code in C or assembly language is compatible with 
both archive libraries and shared libraries. As a result, you should not have to 
change the code in any applications you already have when you use a shared 
library with them. When coding a new application for use with a shared 
library, you should just observe your standard coding conventions. 

However, do keep the following two points in mind, which apply when 
using either an archive or a shared library: 

■ Don't define symbols in your application with the same names as 
those in a library. 

Although there are exceptions, you should avoid redefining standard 
library routines, such as printf(3S) and strcmp(3C). Replacements that 
are incompatibly defined can cause any library, shared or unshared, to 
behave incorrectly. 

■ Don't use undocumented archive routines. 

Use only the functions and data mentioned on the manual pages 
describing the routines in Section 3 of the Programmer's Reference 
Manual, 



Deciding Whetiier to Use a Siiared Library 

You should base your decision to use a shared library on whether it saves 
space in disk storage and memory for your program. A well-designed shared 
library almost always saves space. So, as a general rule, use a shared library 
when it is available. 

To determine what savings are gained from using a shared library, you 
might build the same application vsdth both an archive and a shared library, 
assuming both kinds are available. Remember, that you may do this because 
source code is compatible between shared libraries and archive libraries. (See 
the above section, "Coding an Application,") Then compare the two versions 
of the application for size and performance. For example. 
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$ cat hello. c 

nmn( ) 

{ 

printf ( "HelloNii" ) ; 

} 

$ cc -o unshared hello. c 

$ cc -o shared hello. c — lc_s 

$ size unshared shared 

unshared: 8680 + 1388 + 2248 = 12316 

shared: 300 -(- 680 2248 = 3228 



If the application calls only a few library members, it is possible that using 
a shared library could take more disk storage or memory. The following sec- 
tion gives a more detailed discussion about when a shared library does and 
does not save space. 

When making your decision about using shared libraries, also remember 
that they are not available on UNIX System V releases prior to Release 3.0. If 
your program must run on previous releases, you will need to use archive 
libraries. 



More About Saving Space 

This section is designed to help you better understand why your programs 
will usually benefit from using a shared library. It explains 

■ how shared libraries save space that archive libraries cannot 

■ how shared libraries are implemented on the UNIX System 

■ how shared libraries might increase space usage 

How Shared Libraries Save Space 

To better understand how a shared library saves space, we need to com- 
pare it to an archive library. 
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A host shared library resembles an archive library in three ways. First, as 
noted earlier, both are archive files. Second, the object code in the library 
typically defines commonly used text symbols and data symbols. The sym- 
bols defined inside, and made visible outside, the library are external symbols. 
Note that the library may also have imported symbols, symbols that it uses 
but usually does not define. Third, the link editor searches the library for 
these symbols when linking a program to resolve its external references. By 
resolving the references, the link editor produces an executable version of the 
program, the a«out file. 



NOTE 



Note that the link editor on the UNIX System is a static linking tool; 
static linking requires that all symbolic references in a program be 
resolved before the program may be executed. The link editor uses static 
linking with both an archive library and a shared library. 



Although these similarities exist, a shared library differs significantly from 
an archive library. The major differences are related to how the libraries are 
handled to resolve symbolic references, a topic already discussed briefly. 

Consider how the UNIX System handles both types of libraries during 
link editing. To produce an a.out file using an archive library, the link editor 
copies the library code that defines a program's unresolved external reference 
from the library into appropriate .text and .data sections in the program's 
object file. In contrast, to produce an a.out file using a shared library, the link 
editor copies from the shared library into the program's object file only a 
small amount of code for initialization of imported symbols. (See the section 
"Importing Symbols" later in the chapter for more details on imported sym- 
bols.) For the bulk of the library code, it creates a special section called .lib in 
the file that identifies the library code needed at run time and resolves the 
external references to shared library symbols with their correct values. When 
the UNIX System executes the resulting a.out file, it uses the information in 
the .lib section to bring the required shared library code into the address 
space of the process. 

Figure 8-1 depicts the a,out files produced using a regular archive version 
and a shared version of the standard C library to compile the following pro- 
gram: 



SHARED LIBRARIES 8-7 



Using a Shared Library 




{ 



printf( "Hew do yoa like this raanual?\n" ); 



result = stratpC "I do.", answer ); 




Notice that the shared version is smaller. Figure 8-2 depicts the process 
images in memory of these two files when they are executed. 

a.out Using a.out Using 

Archive Library Shared Library 



FILE HEADER 



program .text 

library .text 
for printf(3S) and 
8trcinp(3C) 

program .data 

library .data 
for print£(3S) and 
8trcinp(3C) 

SYMBOL TABLE 

STRING TABLE 



Figure 8-1: a.out Files Created Using an Archive Library and a Shared Library 



Created by the link editor. 
Refers to library code for 
print and strcmp(3C) 



FILE HEADER 



program .text 



program .data 



.lib 



SYMBOL TABLE 



STRING TABLE 



Copied to file by 
the link editor 
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Now consider what happens when several a.out files need the same code 
from a library. When using an archive library, each file gets its ovm copy of 
the code. This results in duplication of the same code on the disk and in 
memory when the a,out files are run as processes. In contrast, when a shared 
library is used, the library code remains separate from the code in the a.out 
files, as indicated in Figure 8-2. This separation enables all processes using 
the same shared library to reference a single copy of the code. 

May be brought 
to other processes 
simultaneously 



Address 
Space 



Archive 
Version 




Library 



Brought into process' 
address space 



Library code referred 
to by .lib 



Figure 8-2: Processes Using an Archive and a Shared Library 



How Shared Libraries Are implemented 

Now that you have a better understanding of how shared libraries save 
space, you need to consider their implementation on the UNIX System to 
understand how they might increase space usage (this happens seldomly). 
The following paragraphs describe host and target shared libraries, the branch 
table, and then, how shared libraries might increase space usage. 

Tiie Host Library and Target Library 

As previously mentioned, every shared library has two parts: the host 
library used for linking that resides on the host machine and the target library 
used for execution that resides on the target machine. The host machine is 
the machine on which you build an a.out file; the target machine is the 
machine on which you run the file. Of course, the host and target may be the 
same machine, but they don't have to be. 
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The host library is just like an archive library. Each of its members (typi- 
cally a complete object file) defines text and data symbols in its symbol table. 
The link editor searches this file when a shared library is used during the 
compilation or link editing of a program. 

The search is for definitions of symbols referenced in the program but not 
defined there. However, as mentioned earlier, the link editor does not copy 
the library code defining the symbols into the program's object file. Instead, it 
uses the library members to locate the definitions and then places symbols in 
the file that tell where the library code is. The result is the special section in 
the a.out file mentioned earlier (see the section "What is a Shared Library?") 
and shown in Figure 8-1 as .lib. 

The target library used for execution resembles an a.out file. The UNIX 
Operating System reads this file during execution if a process needs a shared 
library. The special Jib section in the a.out file tells which shared libraries 
are needed. When the UNIX System executes the a.out file, it uses this sec- 
tion to bring the appropriate library code into the address space of the pro- 
cess. In this way, before the process starts to run, all required library code has 
been made available. 

Shared libraries enable the sharing of .text sections in the target library, 
which is where text symbols are defined. Although processes that use the 
shared library have their own virtual address spaces, they share a single phy- 
sical copy of the library's text among them. That is, the UNIX System uses 
the same physical code for each process that attaches a shared library's text. 

The target library cannot share its .data sections. Each process using data 
from the library has its own private data region (contiguous area of virtual 
address space that mirrors the .data section of the target library). Processes 
that share text do not share data and stack area in order that they do not 
interfere with one another. 

As suggested above, the target library is a lot like an a.out file, which can 
also share its text, but not its data. Processes must have execute permission 
before a target library can execute an a.out file that uses the library. 

The Branch Table 

When the link editor resolves an external reference in a program, it gets 
the address of the referenced symbol from the host library. This is because a 
static linking loader like Id binds symbols to addresses during link editing. In 
this way, the a.out file for the program has an address for each referenced 
symbol. 
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What happens if library code is updated ar\d the address of a symbol 
changes? Nothing happens to an a.out file built with an archive library, 
because that file already has a copy of the code defining the symbol. (Even 
though it isn't the updated copy, the a.out file will still run.) However, the 
change can adversely affect an a.out file built with a shared library. This file 
has only a symbol telling where the required library code is. If the library 
code were updated, the location of that code might change. Therefore, if the 
a.out file ran after the change took place, the operating system could bring in 
the wrong code. To keep the a.out file current, you might have to recompile a 
program that uses a shared library after each library update. 

To prevent the need to recompile, a shared library is implemented with a 
branch table on the UNIX System. A branch table associates text symbols 
with absolute addresses that do not change even when library code is 
changed. Each address labels a jump instruction to the address of the code 
that defines a symbol. Instead of being directly associated with the addresses 
of code, text symbols have addresses in the branch table. 

Figure 8-3 shows two a.out files executing a call to printf(3S). The pro- 
cess on the left was built using an archive library. It already has a copy of the 
library code defining the printf(3S) symbol. The process on the right was 
built using a shared library. This file references an absolute address (10) in 
the branch table of the shared library at run time; at this address, a jump 
instruction references the needed code. 
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Shared 

A shared library uses Library 
a branch table. , , , 




Figure 8-3: A Branch Table in a Shared Library 



Data symbols do not have a mechanism to prevent a change of address 
between shared libraries. The tool chkshlib(l) compares a.out files v^^ith a 
shared library to check compatibility and help you decide if the files need to 
be recompiled. See " Checking Versions of Shared Libraries Using 
chkshUb(l)." 

How Shared Libraries IMight increase Space Usage 

A target library might add space to a process. Recall from "Hov^^ Shared 
Libraries are Implemented" in this chapter that a shared library's target file 
may have both text and data regions connected to a process. While the text 
region is shared by all processes that use the library, the data region is not. 
Every process that uses the library gets its own private copy of the entire 
library data region. Naturally, this region adds to the process's memory 
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requirements. As a result, if an application uses only a small part of a shared 
library's text and data, executing the application might require more memory 
with a shared library than without one. For example, it would be unwise to 
use the shared C library to access only strcmp(3C). Although sharing 
strcmp(3C) saves disk storage and memory, the memory cost for sharing all 
the shared C library's private data region outweighs the savings. The archive 
version of the library would be more appropriate. 

A host library might add space to an a.out file. Recall that UNIX System 
V Release 3.0 uses static linking, which requires that all external references in 
a program be resolved before it is executed. Also recall that a shared library 
may have imported symbols, which are used but not defined by the library. 
To resolve these references, the link editor has to add to the a,out initializa- 
tion code defining the referenced imported symbols file. This code increases 
the size of the a.out file. 



Identifying a.out Files that Use Shared 
Libraries 

Suppose you have an executable file and you want to know whether it 
uses a shared library. You can use the dumpO) command (documented in the 
Programmer's Reference Manual) to look at the section headers for the file: 

dump -hv a^out 

If the file has a .lib section, a shared library is needed. If the a.out does 
not have a .lib section, it does not use shared libraries. 

To display the libraries used by a.out, use the -L option as shown in the 
following example: 

dump -L a.out 
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Debugging a.out Files tiiat Use Siiared 
Libraries 

sdb reads the shared libraries' symbol tables and performs as documented 
(in the Programmer's Reference Manual) using the available debugging informa- 
tion. The branch table is hidden so that functions in shared libraries can be 
referenced by their names, and the M command lists the names of shared 
libraries' target files used by the executable file, among other information. 

Shared library data are not dumped to core files, however. So, if you 
encounter an error that results in a core dump and does not appear to be in 
your application code, you may find debugging easier if you rebuild the appli- 
cation with the archive version of the library used. 
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This part of the chapter explains how to build a shared library. It covers 
the major steps in the building process, the use of the UNIX System tool 
mkshlib(l) that builds the host and target libraries, and some guidelines for 
writing shared library code. There is an example at the end of this part which 
demonstrates the major features of mkshlib and the steps in the building pro- 
cess. 

This part assumes that you are an advanced C programmer faced with the 
task of building a shared library. It also assumes you are familiar with the 
archive library building process. You do not need to read this part of the 
chapter if you only plan to use the UNIX System shared libraries or other 
shared libraries that have already been built. 



Tiie Building Process 

To build a shared library on the UNIX System, you have to complete six 
major tasks: 

■ choosing region addresses 

■ choosing the pathname for the shared library target file 

■ selecting the library contents 

■ rewriting existing library code to be included in the shared library 

■ writing the library specification file 

■ using the mkshlib tool to build the host and target libraries 
Here each of these tasks is discussed. 

Step 1: Choosing Region Addresses 

The first thing you need to do is choose region addresses for your shared 
library. 

Shared library regions on the 386-based computer correspond to memory 
management unit (MMU) segment size, each of which is 4 MB. The following 
table gives a list of the segment assignments on the 386-based computer (as of 
the copyright date for this guide) and shows what virtual addresses are avail- 
able for libraries you might build. 



SHARED LIBRARIES 8-15 



Building a Shared Library 



Start 
Address 


Contents 


Target 
Pathname 


OxAOOOOOOO 


Reserved for AT&T 




OxASCOOOOO 


UNIX System Shared C Library 
AT&T Networking Library 


/shlib/libc_s 
/shlib /libnsL^ 


OxA4000000 
0xA4400000 
0xA4800000 
0xA4C00000 


Generic Database Library 


Unassigned 


OxASOOOOOO 
0xA5400000 
OxASSOOOOO 
OxASCOOOOO 


Generic Statistical Library 


Unassigned 


0xA6000000 
0xA6400000 
0xA6800000 
0xA6C00000 


Generic User Interface Library 


Unassigned 


0xA7000000 
0xA7400000 
0xA7800000 
0xA7C00000 


Generic Screen Handling Library 


Unassigned 


OxASOOOOOO 
0xA8400000 
OxASSOOOOO 
OxASCOOOOO 


Generic Graphics Library 


Unassigned 


0xA9000000 
0xA9400000 
0xA9800000 
0xA9C00000 


Generic Networking Library 


Unassigned 
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Start 
Address 


Contents 


Target 
Pathname 


OxAAOOOOOO 


Generic - to be defined 


Unassigned 


OxAFCOOOOO 






OxBOOOOOOO 


For private use 


Unassigned 


OxBFCOOOOO 







What does this table tell you? First, the UNIX System shared C library 
and the networking library reside at the pathnames given above and use 
addresses in the reserved range. If you build a shared library that uses 
reserved addresses you run the risk of conflicting with future products. 

Second, a number of segments are allocated for shared libraries that pro- 
vide various services such as graphics, database access, and so on. These 
categories are intended to reduce the chance of address conflicts among com- 
mercially available libraries. Although two libraries of the same type may 
conflict, that doesn't matter. A single process should not usually need to use 
two shared libraries of the same type. If the need arises, a program can use 
one shared library and one archive library. 



NOTE 



Any number of libraries can use the same virtual addresses, even on the 
same machine. Conflicts occur only within a single process, not among 
separate processes. Thus two shared libraries can have the same region 
addresses without causing problems, as long as a single a.out file doesn't 
need to use both libraries. 



Third, several segments are reserved for private use. If you are building a 
large system with many a.out files and processes, shared libraries might 
improve its performance. As long as you don't intend to release the shared 
libraries as separate products, you should use the private region addresses. 
You can put your shared libraries into these segments and avoid conflicting 
with commercial shared libraries. You should also use these segments when 
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you will own all the a.out files that access your shared library. Don't risk 
address conflicts. 



If you plan to build a commercial shared library, you are strongly 
NOTE encouraged to provide a compatible, relocatable archive as well. Some of 
I your customers might not find the shared library appropriate for their appli 
I cations. Others might want their applications to run on versions of the 
UNIX System without shared library support. 



Step 2: Choosing the Target Library Pathname 

After you choose the region addresses for your shared library, you should 
choose the pathname for the target library. We chose /shlib/libc— s for the 
shared C library and /shlib/libnsl_s for the networking library. (As men- 
tioned earlier, we use the _s suffix in the pathnames of all statically linked 
shared libraries.) To choose a pathname for your shared library, consult the 
established list of names for your computer or see your system administrator. 
Also keep in mind that shared libraries needed to boot a UNIX System should 
normally be located in /shlib; other application libraries normally reside in 
/usr/lib or in private application directories. Of course, if your shared library 
is for personal use, you can choose any convenient pathname for the target 
library. 

Step 3: Selecting Library Contents 

Selecting the contents for your shared library is the most important task in 
the building process. Some routines are prime candidates for sharing; others 
are not. For example, it's a good idea to include large, frequently used rou- 
tines in a shared library but to exclude smaller routines that aren't used as 
much. What you include will depend on the individual needs of the program- 
mers and other users for whom you are building the library. There are some 
general guidelines you should follow, however. They are discussed in the 
section " Choosing Library Members " in this chapter. Also see the guidelines 
in the following sections: "Importing Symbols," "Referencing Symbols in a 
Shared Library from Another Shared Library," and "Tuning the Shared 
Library Code. " 
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Step 4: Rewriting Existing Library Code 

If you choose to include some existing code from an archive library in a 
shared library, changing some of the code will make the shared code easier to 
maintain. See the section " Changing Existing Code for the Shared Library " 
in this chapter. 

Step 5: Writing tlie Library Specification File 

After you select and edit all the code for your shared library, you have to 
build the shared library specification file. The library specification file con- 
tains all the information that mkshlib needs to build both the host and target 
libraries. An example specification file is given in the section towards the end 
of the chapter, "An Example." Also, see the section "Using the Specification 
File for Compatibility " in this chapter. The contents and format of the specif- 
ication file are given by the following directives (see also the mkshlib(l) 
manual page). 

All directives that are followed by multi-line specifications are valid until 
the next directive or the end of file. 

#address sectname address 

Specifies the start address, address, of section sectname for 
the target file. This directive is typically used to specify 
the start addresses of the .text and .data sections. 

#target pathname 

Specifies the pathname, pathname, of the target shared 
library on the target machine. This is the location where 
the operating system looks for the shared library during 
execution. Normally, pathname will be an absolute path- 
name, but it does not have to be. 

This directive must be specified exactly once per specifica- 
tion file. 

#branch Starts the branch table specifications. The lines following 

this directive are taken to be branch table specification 
lines. 

Branch table specification lines have the following format: 

funcname <white space> position 

funcname is the name of the symbol given a branch table 
entry and position specifies the position of funcname's 
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branch table entry, position may be a single integer or a 
range of integers of the form positionl-positionl. Each 
position must be greater than or equal to one. The same 
position cannot be specified more than once, and every 
position from one to the highest given position must be 
accounted for. 

If a sjrmbol is given more than one branch table entry by 
associating a range of positions with the symbol or by 
specifying the same symbol on more than one branch 
table specification line, then the symbol is defined to have 
the address of the highest associated branch table entry. 
All other branch table entries for the symbol can be 
thought of as empty slots and can be replaced by new 
entries in future versions of the shared library. 

Finally, only functions should be given branch table 
entries, and those functions must be external. 

This directive must be specified exactly once per shared 
library specification file, 

#objects Specifies the names of the object files constituting the tar- 

get shared library. The lines following this directive are 
taken to be the list of input object files in the order they 
are to be loaded into the target. The list simply consists 
of each filename followed by a newline character. This 
list of objects will be used to build the shared library. 

This directive must be specified exactly once per shared 
library specification file. 

#objects noload 

Specifies the ordered list of host shared libraries to be 
searched to resolve references to symbols not defined in 
the library being built and not imported. Resolution of a 
reference in this way requires a version of the symbol 
with an absolute address to be found in one of the listed 
libraries. It is considered an error if a non-shared version 
of a symbol is found during the search for a shared ver- 
sion of the symbol. 

Each name specified is assumed to be a pathname to a 
host or an argument of the form -IX, where IibX.a is the 
name of a file in the default library locations. This 
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behavior is identical to that of Id, and the -L option can 
be used on the command line to specify other directories 
in which to locate these archives. 

#init object Specifies that the object file, object, requires initialization 
code. The lines following this directive are taken to be 
initialization specification lines. 

Initialization specification lines have the following format: 

ptr <white space> import 

ptr is a pointer to the associated imported symbol, import, 
and must be defined in the current specified object file, 
object. The initialization code generated for each such line 
is of the form: 

ptr — Reimport ; 

All initializations for a particular object file must be given 
once, and multiple specifications of the same object file 
are not allowed. 

#hide linker [***] 

This directive changes symbols that are normally external 
into static symbols, local to the library being created. A 
regular expression may be given [sh(l), egrep(l) in the 
User's/System Administrator's Reference], in which case all 
external symbols matching the regular expression are hid- 
den; the #export directive can be used to counter this 
effect for specified symbols. 

The optional " * " is equivalent to the directive 

#hide liiikier 
* 

and causes all external symbols to be made into static 
S3mibols. 

All symbols specified in #init and #branch directives are 
assumed to be external symbols, and cannot be changed 
into static symbols using the #hide directive. 

#export linker [*] 

Specifies those s3niibols that a regular expression in a 
#hide directive would normally cause to be hidden but 
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that should nevertheless remain external. For example, 

#hide linker * 
#expQrt linker 

one 

two 

causes all s)rmbols except one, two, and those used in 
#braiich and #init entries to be tagged as static. 



#ident string Specifies a string, string, to be included in the .comment 
section of the target shared library and the .comment sec- 
tions of every member of the host shared library. 



Step 6: Using mkshlib to Build tiie Host and Target 

The UNIX System command mkshlib(l) builds both the host and target 
libraries, mkshlib invokes other tools such as the assembler, as(l), and link 
editor, ld(l). Tools are invoked through the use of execvp [see exec(2)], 
v^hich searches directories in a user's $PATH environment variable. Also, 
prefixes to mkshlib are parsed in much the same manner as prefixes to the 
cc(l) command and invoked tools are given the prefix, v^here appropriate. For 
example, 3bmkshlib invokes 3bld. These commands all are documented in 
the Programmer's Reference Manual 

The user input to mkshlib consists of the library specification file and 
command line options. The shared library build tool has the foUov^^ing syntax: 

mkshlib -s specfil -t target [-h host] [-L d/r...] [-n] [-q] 

-s specfil Specifies the shared library specification file, specfil. This file 
contains all the information necessary to build a shared 
library. 

-t target Specifies the name, target, of the target shared library pro- 



duced on the host machine. When target is moved to the tar- 
get machine, it should be installed at the location given in the 
specification file (see the #target directive in the section 
"Writing the Library Specification File"). If the -n option is 
given, then a new target shared library v^ill not be generated. 



## 



Specifies a comment. The rest of the line is ignored. 
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-h host Specifies the name of the host shared library, host. If this 
option is not given, then the host shared library will not be 
produced. 

-n Prevents a new target shared library from being generated. 

This option is useful when producing only a new host shared 
library. The -t option must still be supplied since a version of 
the target shared library is needed to build the host shared 
library. 

-L dir Changes the algorithm of searching for the host shared 

libraries specified with the #objecis noload directive to cause 
the directories in dir to be searched before the default direc- 
tories. The -L option can be specified multiple times on the 
command line, in which case the directories given with the -L 
options are searched in the order given on the command line, 
before the default directories. 

-q Suppresses the printing of warning messages. 



Guidelines for Writing Sliared Library Code 

Because the main advantage of a shared library over an archive library is 
sharing and the space it saves, these guidelines stress ways to increase sharing 
while avoiding the disadvantages of a shared library. The guidelines also 
stress upward compatibility. When appropriate, we describe our experience 
with building the shared C library to illustrate the ways that following a par- 
ticular guideline helped us. 

We recommend that you read these guidelines once from beginning to 
end to get a perspective of the things you need to consider when building a 
shared library, then use it as a checklist to guide your planning and decision- 
making. 

Before we consider these guidelines, let's consider the restrictions to build- 
ing a shared library common to all the guidelines. These restrictions involve 
static linking. Here's a summary of them, some of which are discussed in 
more detail later. Keep them in mind when reading the guidelines in this sec- 
tion. 
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■ External symbols have fixed addresses. 

If an external s3anbol moves, you have to re-link all a.out files that use 
the library. This restriction applies both to text and data symbols. 

Use of the #hide directive to limit externally visible symbols can help 
avoid problems in this area. (See "Use #hide and #export to Limit 
Externally Visible Symbols" in the "Using the Specification File for 
Compatibility" section for more details). 

■ If the library's text changes for one process at run time, it changes for 
all processes. 

■ If the library uses a symbol directly, that symbol's run time value 
(address) must be known when the library is built. 

■ Imported symbols cannot be referenced directly. 

Their addresses are not known when you build the library, and they 
can be different for different processes. You can use imported symbols 
by adding an indirection through a pointer in the library's data. 
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Choosing Library IMembers 

Include Large, Frequently Used Routines 

Large, frequently used routines are prime candidates for sharing. Placing 
them in a shared library saves code space for individual a.out files and saves 
memory, too, when several concurrent processes need the same code. 
printf(3S) and related C library routines (which are documented in the 
Programmer's Reference Manual) are good examples. 



When we built the shared C library 

The printf(3S) family of routines is used frequently. Con- 
sequently, we included printf(3S) and related routines in 
the shared C library. 



Exclude Infrequently Used Routines 

Putting infrequentiy used routines in a shared library can degrade perfor- 
mance, particularly on paging systems. Traditional a,out files contain all code 
they need at run time. By definition, the code in an a.out file is (at least dis- 
tantly) related to the process. Therefore, if a process calls a function, it may 
already be in memory because of its proximity to other text in the process. 

If the function is in the shared library, a page fault may be more likely to 
occur, because the surrounding library code may be unrelated to the calling 
process. Only rarely will any single a.out file use everything in the shared C 
library. If a shared library has unrelated functions, and unrelated processes 
make random calls to those functions, the locality of reference may be 
decreased. The decreased locality may cause more paging activity and, 
thereby, decrease performance. See also " Organize to Improve Locality " in 
the "Tuning the Shared Library Code" section. 

Exclude Routines that Use IMuch Static Data 

Routines that use much static data increase the size of processes. As 
"How Shared Libraries are Implemented" and "Deciding Whether to Use a 
Shared Library " have explained, every process that uses a shared library gets 
its own private copy of the library's data, regardless of how much of the data 
is needed. Library data is static: it is not shared and cannot be loaded selec- 
tively with the provision that unreferenced pages may be removed from the 
working set. 
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For example, getgrent(3C), which is documented in the Programmer's 
Reference Manual, is not used by many standard UNIX System commands. 
Some versions of the module define over 1400 bytes of unshared, static data. 
It probably should not be included in a shared library. You can import global 
data, if necessary, but not local, static data. 

Exciude Routines tiiat Complicate Maintenance 

All external symbols must remain at constant addresses. The branch table 
makes this easy for text symbols, but data symbols don't have an equivalent 
mechanism. The more data a library has, the more likely some of them will 
have to change size. Any change in the size of external data may affect sym- 
bol addresses and break compatibility. 

include Routines tlie Library Itself Needs 

It usually pays to make the library self-contained. For example, printf(3S) 
requires much of the standard I/O library, A shared library containing 
printf(3S) should contain the rest of the standard I/O routines, too. 
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This guideline should not take priority over the others in this section. If 
NOTE you exclude some routine that the library itself needs based on a previ- 
I ous guideline, consider leaving the s)anbol out of the library and import- 
I ing it. 



Changing Existing Code for tiie Sliared Library 

All C code that works in a shared library will also work in an archive 
library. However, the reverse is not true because a shared library must expli- 
citly handle imported symbols. The following guidelines are meant to help 
you produce shared library code that is still valid for archive libraries 
(although it may be slightly bigger and slower). The guidelines explain how 
to structure data for ease of maintenance, since most compatibility problems 
involve restructuring data. 

Minimize Global Data 

All external data symbols are, of course, visible to applications. This can 
make maintenance difficult. You should try to reduce global data, as 
described below. 

First, try to use automatic (stack) variables. Don't use permanent storage 
if automatic variables work. Using automatic variables saves static data space 
and reduces the number of symbols visible to application processes. 

Second, see whether variables really must be external. Static symbols are 
not visible outside the library, so they may change addresses between library 
versions. Only external variables must remain constant. See " Use #hide and 
#export to Limit Externally Visible Symbols " in the section " Using the 
Specification File for Compatibility" later in this chapter for further tips. 

Third, allocate buffers at run time instead of defining them at compile 
time. This does two important things. It reduces the size of the library's data 
region for all processes and, therefore, saves memory; only the processes that 
actually need the buffers get them. It also allows the size of the buffer to 
change from one release to the next without affecting compatibility. Statically 
allocated buffers cannot change size without affecting the addresses of other 
symbols and, perhaps, breaking compatibility. 
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Define Text and Global Data in Separate Source Files 

Separating text from global data makes it easier to prevent data symbols 
from moving. If new external variables are needed, they can be added at the 
end of the old definitions to preserve the old symbols' addresses. 

Archive libraries let the link editor extract individual members. This 
sometimes encourages programmers to define related variables and text in the 
same source file. This works fine for relocatable files, but shared libraries 
have a different set of restrictions. Suppose external variables were scattered 
throughout the library modules. Then external and static data would be inter- 
mixed. Changing static data, such as a string, like hello in the following 
example, moves subsequent data symbols, even the external symbols: 



Before Broken Successor 

int head = 0; int head = 0; 

func( ) func( ) 

{ { 

p = "hello"; p = «'hello, world"; 

} } 

int tail = 0; ant tail = 0; 



Assume the relative virtual address of head is for both examples. The 
string literals will have the same address too, but they have different lengths. 
The old and new addresses of tail thus might be 12 and 20, respectively. If 
tail is supposed to be visible outside the library, the two versions will not be 
compatible. 
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The compilation system sometimes deHnes and uses static data invisibly 
NOTE to the user (e.g. tables for switch statements). Therefore, it is a mistake 
I to assume that because you declare no static data in your shared library 
I that you can ignore the guideline in this section. 



Adding new external variables to a shared library may change the 
addresses of static symbols, but this doesn't affect compatibility. An a.out file 
has no way to reference static library symbols directly, so it cannot depend on 
their values. Thus it pays to group all external data symbols and place them 
at lower addresses than the static (hidden) data. You can write the specifica- 
tion file to control this. In the list of object files, make the global data files 
first. 




rfobjects 



datal.o 



lastdata.o 

textl.o 

text2.o 




If the data modules are not first, a seemingly harmless change (such as a 
new string literal) can break existing a.out files. 

Shared library users get all library data at run time, regardless of the 
source file organization. Consequently, you can put all external variables' 
definitions in a single source file without a space penalty. 

Initialize Global Data 

Initialize external variables, including the pointers for imported symbols. 
Although this uses more disk space in the target shared library, the expansion 
is limited to a single file, mkshlib v^l give a fatal error if it finds an unini- 
tialized external symbol. 
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Using the Specification File for Compatibility 

The way in which you use the directives in the specification file can affect 
compatibility across versions of a shared library. This section gives some 
guidelines on how to use the directives #bra]ich, #hide, and #export. 

Preserve Branch Table Order 

You should add new functions only at the end of the branch table. After 
you have a specification file for the library, try to maintain compatibility with 
previous versions. You may add new functions without breaking old a.out 
files as long as previous assignments are not changed. This lets you distribute 
a new library without having to re-link all of the a.out files that used a previ- 
ous version of the library. 

Use #hide and #export to Limit Externally Visible Symbols 

Sometimes variables (or functions) must be referenced from several object 
files to be included in the shared library and yet are not intended to be avail- 
able to users of the shared library. That is, they must be external so that the 
link editor can properly resolve all references to symbols and create the target 
shared library, but should be hidden from the user's view to prevent their use. 
Such unintended and unwanted use can result in compatibility problems if the 
symbols move or are removed between versions of the shared library. 

The #hide and #export directives are the key to resolving this dilemma. 
The #hide directive causes mkshlib, after resolving all references within the 
shared library, to alter the symbol tables of the shared library so that all speci- 
fied external symbols are made static and unaccessible from user code. You 
can specify the symbols to be so treated either individually and/or through 
the use of regular expressions. 

The #export directive allows you to specify those symbols in the range of 
an accompanying #hide directive regular expression which should remain 
external. It is simply a convenience. 



NOTE 



It is a fatal error to try to explicitly name the same symbol in a #hide and 
an #export directive. For example, the following would result in a fatal 
error. 



j$^de liJiker 
one 

#e3qx a rt linker 
one 
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#export may seem like an unnecessary feature since you could avoid 
specifying in the #hide directive those symbols that you do not want to be 
made static. However, its usefulness becomes apparent when the shared 
library to be built is complicated, and there are many symbols to be made 
static. In these cases, it is more efficient to use regular expressions to make all 
external variables static and individually list those symbols you need to be 
external. The simple example in the section "Writing the Library Specification 
File" demonstrates this point. 



NOTE 



Symbols mentioned in the #branch and #init directives are services of 
the shared library, must be external symbols, and cannot be made static 
through the use of these directives. 



When we built the shared C library 

Our approach for the shared C library was to hide all data 
symbols by default, and then explicitly export symbols 
that we knew were needed. The advantage of this 
approach is that future changes to the libraries won't 
introduce new external symbols (possibly causing name 
collisions), unless we explicitly export the new symbols. 

We chose the symbols to export by looking at a list of all 
the current external symbols in the shared C library and 
finding out what each sjmbol was used for. The symbols 
that were global but were only used in the shared C 
library were not exported; these sjnnbols will be hidden 
from applications code. All other symbols were explicitly 
exported. 



Importing Symbols 

Normally, shared library code cannot directly use symbols defined outside 
a library, but an escape hatch exists. You can define pointers in the data area 
and arrange for those pointers to be initialized to the addresses of imported 
symbols. Library code then accesses imported symbols indirectly, delaying 
symbol binding until run time. Libraries can import both text and data 
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symbols. Moreover, imported symbols can come from the user's code, 
another library, or even the library itself. In Figure 8-4, the symbols 
—libc.ptrl and _libc.ptr2 are imported from user's code and the sjmbol 
Jibc— malice from the library itself. 




The following guidelines describe when and how to use imported sym- 
bols. 

Imported Symbols that the Library Does Not Define 

Archive libraries typically contain relocatable files, which allow undefined 
references. Although the host shared library is an archive, too, that archive is 
constructed to mirror the target library, which more closely resembles an a.out 
file. Neither target shared libraries nor a.out files can have unresolved refer- 
ences to symbols. 

Consequently, shared libraries must import any symbols they use but do 
not define. Some shared libraries will derive from existing archive libraries. 
For the reasons stated above, it may not be appropriate to include all the 
archive's modules in the target shared library. Remember though that if you 
exclude a symbol from the target shared library that is referenced from the 
target shared library, you will have to import the excluded s3anboL 



8-32 PROGRAMMER'S GUIDE 



Building a Sliared Library 



Imported Symbols that Users Must Be Able to Redefine 

Optionally, shared libraries can import their own symbols. At first this 
might appear to be an unnecessary complication, but consider the following. 
Two standard libraries, libc and libmalloc, provide a malice fanuly. Even 
though most UNIX System commands use the malice from the C library, they 
can choose either library or define their own. 



When we built the shared C library 

Three possible strategies existed for the shared C library. 
First, we could have excluded the malloc(3C) family. 
Other library members would have needed it, and so it 
would have been an imported symbol. This would have 
worked, but it would have meant less savings. 

Second, we could have included the malice family and 
not imported it. This would have given us more savings 
for typical commands, but it had a price. Other library 
routines call malice directly, and those calls could not 
have been overridden. If an application tried to redefine 
malice, the library calls would not have used the alternate 
version. Furthermore, the link editor would have found 
multiple definitions of malice while building the applica- 
tion. To resolve this the library developer would have to 
change source code to remove the custom malice, or the 
developer would have to refrain from using the shared 
library. 

Finally, we could have included malice in the shared 
library, treating it as an imported symbol. This is what we 
did. Even though malice is in the library, nothing else 
there refers to it directly; all references are through an 
imported symbol pointer. If the application does not rede- 
fine malice, both application and library calls are routed 
to the library version. All calls are mapped to the alter- 
nate, if present. 



You might want to permit redefinition of all library symbols in some libraries. 
You can do this by importing all symbols the library defines, in addition to 
those it uses but does not define. Although this adds a little space and time 
overhead to the library, the technique allows a shared library to be one 
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hundred percent compatible with an existing archive at link time and run 
time. 

Mechanics of Importing Symbols 

Let's assume a shared library wants to import the symbol malloc. The 
original archive code and the shared library code appear below. 



Archive Code 



extern char *inalloc(); 

eaqpoartC ) 
{ 

p = iialloc{n); 



Shared Library Code 

/♦ See pointers. c on next page V 

extern char *(*_libc_malloc) ( ) ; 

e3qport( ) 
{ 

p = (*_libc_malloc) (n) ; 



Making this transformation is straightforward, but two sets of source code 
would be necessary to support both an archive and a shared library. Some 
simple macro definitions can hide the transformations and allow source code 
compatibility. A header file defines the macros, and a different version of this 
header file would exist for each type of library. The -I flag to cc(l), docu- 
mented in the Programmer's Reference Manual, would direct the C preprocessor 
to look in the appropriate directory to find the desired file. 
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Archive import.h 



Shared import.h 



/* 



* Macax)S for im porting 

* symbols. One ^define 

* per symbol. 
♦/ 



#define roadloc (♦_libcjialloc) 



extern char *nalloc{ ) ; 





These header files allow one source both to serve the original archive 
source and to serve a shared library, too, because they supply the indirections 
for imported symbols. The declaration of malice in importh actually declares 
the pointer — libr malice. 




Common Source 
#include "iirjxart.h" 

extern char *inaIloc( ) ; 



ejqport( ) 
{ 



p = mlloc(n); 



} 
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Alternatively, one can hide the #iiiclude with #ifdef : 



f 

Common Source 

#i£dG£ SHUB 

# include "in^ort.h" 

#eixli£ 

extern char *iialloc( ) ; 

eaqportO 
{ 

p = nalloc(n); 

} 

V 



Of course the transformation is not complete. You must define the 
pointer — libc malice. 

File pointers.c 

char *(*_litcjnalloc)() = 0; 

Note that _ libc malice is initialized to zero, because it is an external data 
symbol. 

Special initialization code sets the pointers. Shared library code should 
not use the pointer before it contains the correct value. In the example the 
address of malice must be assigned to -Jibc_malloc. Tools that build the 
shared library generate the initialization code according to the library specifi- 
cation file. 

Pointer Initiaiization Fragments 

A host shared library archive member can define one or many imported 
S)anbol pointers. Regardless of the number, every imported symbol pointer 
should have initialization code. 
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This code goes into the a.out file and does two things. First, it creates an 
unresolved reference to make sure the symbol being imported gets resolved. 
Second, initialization fragments set the imported symbol pointers to their 
values before the process reaches main. If the imported symbol pointer can 
be used at run time, the imported symbol will be present, and the imported 
symbol pointer will be set properly. 



NOTE 



Initialization fragments reside in the host, not the target, shared library. 
The link editor copies initialization code into a.out files to set imported 
pointers to their correct values. 



Library specification files describe how to initialize the imported symbol 
pointers. For example, the following specification line would set 
— libr malice to the address of malice: 

#init pnalloc.o 
_libcjiialloc malice 

When mkshlib builds the host library, it modifies the file pmallcc.o, 
adding relocatable code to perform the following assignment statement: 

_libc_inalloc = Sjnalloc; 

When the link editor extracts pmalloc.o from the host library, the relocat- 
able code goes into the a.out file. As the link editor builds the final a.cut file, 
it resolves the unresolved references and collects all initialization fragments. 
When the a.out file is executed, the run time startup routines execute the ini- 
tialization fragments to set the library pointers. 

Selectively Loading Imported Symbols 

Defining fewer pointers in each archive member increases the granularity 
of symbol selection and can prevent unnecessary objects and initialization 
code from being linked into the a.cut file. For example, if an archive member 
defines three pointers to imported symbols, the link editor will require defini- 
tions for all three symbols, even though only one might be needed. 

You can reduce unnecessary loading by writing C source files that define 
imported symbol pointers singly or in related groups. If an imported symbol 
must be individually selectable, put its pointer in its own source file (and 
archive member). This will give the link editor a finer granularity to use 
when it resolves the reference to the symbol. 
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Let's look at an example. In the coarse method, a single source file might 
define all pointers to imported symbols: 



Old pointers.c 

int (♦_liJxLptr1)() = 0; 
char ♦(♦_libcnalloc)() = 0; 
int (♦_litocj>1:r2)() = 0; 




Allowing the loader to resolve only those references that are needed 
requires multiple source files and archive members. Each of the new files 
defines a single pointer: 



File 


Contents 


ptrl.c 


int (*J.ibc_ptr1)() = 0; 


pmallocc 


char *(*JLibcmlloc)() = 0; 


ptr2.c 


int (*JLibc_ptr2)() = 0; 



Using the three files ensures that the link editor will only look for definitions 
for imported symbols and load in the corresponding initialization code in 
cases where the symbols are actually used. 

Referencing Symbols in a Shared Library from Another 
Shared Library 

At the beginning of the section "Importing Symbols," there was a state- 
ment that "normally, shared libraries cannot directly use symbols defined 
outside the shared library. " This is true in general, and you should import all 
symbols defined outside the shared library whenever possible. 
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Unfortunately, this is not always possible, as for example when floating- 
point operations are performed in a shared library to be built. When such 
operations are encountered in any C code, the standard C compiler generates 
calls to functions to perform the actual operations. These functions are 
defined in the C library and are normally resolved in a manner invisible to the 
user when an a.out is created, since the cc command automatically causes the 
relocatable version of the C library to be searched. These floating-point rou- 
tine references must be resolved at the time the shared library is being built. 
But, the symbols cannot be imported, because their names and usage are 
invisible. 

The #objects noload directive has been provided to allow symbol refer- 
ences such as these to be resolved at the time the shared library is built, pro- 
vided that the symbols are defined in another shared library. If there are 
unresolved references to symbols after the object files listed with the #objects 
directive have been link edited, the host shared libraries specified with the 
#objects noload directive are searched for absolute definitions of the symbols. 
The normal use of the directive would be to search the shared version of the 
C library to resolve references to floating-point routines. 

For this use, the syntax in the specification file would be 

#oibjects noload 
— lc_s 

This would cause mkshlib to search for the host shared library libc^.a in the 
default library locations and to use it to resolve references to any symbols left 
unresolved in the shared library being built. The -L option can be used to 
cause mkshlib to look for the specified library in other than the default loca- 
tions. 

A few notes on usage are in order. When building a shared library using 
#objects noload, you must make sure that for each symbol with an 
unresolved reference there is a version of the symbol with an absolute defini- 
tion in the searched host shared libraries, before any relocatable version of 
that symbol, mkshlib will give a fatal error if this is not the case, because 
relocatable definitions do not have absolute addresses and therefore do not 
allow complete resolution of the target shared library. 

When using a shared library built with references to symbols resolved 
from another shared library, both libraries must be specified on the cc com- 
mand line. The dependent library must be specified on the command line 
before the libraries on which it depends. (See the section " Building an a.out 
File" for more details.) If you provide a shared library which references 
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symbols in another shared library, you should make sure that your documen- 
tation clearly states that users must specify both libraries when building a.out 
files. 

Finally, as some of the text above hints, it is possible to use #objects 
noload to resolve references to any symbols not defined in a shared library, as 
long as they are defined in some other shared library. We strongly encourage 
you to import as many symbols as possible and to use #objects noload only 
when absolutely necessary. Probably you will only need to use this feature to 
resolve references to floating-point routines generated by the C compiler. 

Importing symbols has several important benefits over resolving refer- 
ences through #objects noload. First, importing symbols is more flexible in 
that it allows you to define your own version of library routines. You can 
define your own versions with archive versions of a library. Preserving this 
ability with the shared versions helps maintain compatibility. 

Importing symbols also helps prevent unexpected name space collisions. 
The link editor will complain about multiple definitions of a symbol, refer- 
ences to which are resolved through the #objects noload mechanism, if a 
user of the shared library also has an external definition of the symbol. 

Finally, #objects noload has the drawback that both the library you build 
and all the libraries on which it depends must be available on all the systems. 
Anyone who wishes to create a.out files using your shared library will need to 
use the host shared libraries. Also, the targets of all the libraries must be 
available on all systems on which the a.out files are to be run. 

Providing Arciiive Library Compatibility 

Having compatible libraries makes it easy to substitute one for the other. 
In almost all cases, this can be done without makefile or source file changes. 
Perhaps the best way to explain this guideline is by example: 



8-40 PROGRAMIMER'S GUIDE 



Building a Shared Library 



When we built the shared C library 

We had an existing archive library to use as the base. This 
obviously gave us code for individual routines, and the 
archive library also gave us a model to use for the shared 
library itself. 

We wanted the host library archive file to be compatible 
with the relocatable archive C library. However, we did 
not want the shared library target file to include all rou- 
tines from the archive, because including them all would 
have hurt performance. 

Reaching these goals was, perhaps, easier than you might 
think. We did it by building the host library in two steps. 
First, we used the available shared library tools to create 
the host library to match exactly the target. The resulting 
archive file was not compatible with the archive C library 
at this point. Second, we added to the host library the set 
of relocatable objects residing in the archive C library that 
were missing from the host library. Although this set is 
not in the shared library target, its inclusion in the host 
library makes the relocatable and shared C libraries com- 
patible. 



Tuning the Shared Library Code 

Some suggestions for how to organize shared library code to improve per- 
formance are presented here. They apply to paging systems, such as UNIX 
System V Release 3.0. The suggestions come from the experience of building 
the shared C library. 

The archive C library contains several diverse groups of functions. Many 
processes use different combinations of these groups, making the paging 
behavior of any shared C library difficult to predict. A shared library should 
offer greater benefits for more homogeneous collections of code. For example, 
a database library probably could be organized to reduce system paging sub- 
stantially, if its static and dynamic calling dependencies were more predict- 
able. 
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Profile the Code 

To begin, profile the code that might go into the shared library (see the 
prof(l) command in the Programmer's Reference Manual), 

Choose Library Contents 

Based on profiling information, make some decisions about what to 
include in the shared library. a.out file size is a static property, and paging is 
a dynamic property. These static and dynamic characteristics may conflict, so 
you have to decide whether the performance lost is worth the disk space 
gained. See "Choosing Library Members" in this chapter for more informa- 
tion. 

Organize to Improve Locality 

When a function is in an a.out file(s), it probably resides in a page with 
other code that is used more often (see "Exclude Infrequently Used Routines" 
in the section "Choosing Library Members"). Try to improve locality of refer- 
ence by grouping dynamically related functions. If every call of funcA gen- 
erates calls to funcB and funcC, try to put them in the same page, cflow(l) 
(documented in the Programmer's Reference Manual) generates this static 
dependency information. Combine it with profiling to see what things actu- 
ally are called, as opposed to what things might be called. 

Align for Paging 

The key is to arrange the shared library target's object files so that fre- 
quently used functions do not unnecessarily cross page boundaries. When 
arranging object files within the target library, be sure to keep the text and 
data files separate. You can reorder text object files without breaking compati- 
bility; the same is not true for object files that define global data. Once again, 
an example might best explain this guideline: 
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When we built the shared C library 

Using name lists and disassemblies of the shared library 
target file, we determined where the page boundaries fell. 

After grouping related functions, we broke them into 
page-sized chunks. Although some object files and func- 
tions are larger than a single page, most of them are 
smaller. Then we used the infrequently called functions as 
glue between the chunks. Because the glue between pages 
is referenced less frequently than the page contents, the 
probability of a page fault decreased. 

After determining the branch table, we rearranged the 
library's object files without breaking compatibility. We 
put frequently used, unrelated functions together because 
we figured they would be called randomly enough to keep 
the pages in memory. System calls went into another 
page as a group, and so on. The following example shows 
how to change the order of the library's object files: 

Before After 

#c±>jects #Qbjects 
• • > • • • 

porintif «o strcn^«o 



Avoid Hardware Thrashing 

You get better performance by arranging the typical process to avoid 
cache entry conflicts. If a heavily used library had both its text and its data 
segment mapped to the same cache entry, the performance penalty would be 
particularly severe. Every library instruction would bring the text segment 
information into the cache. Instructions that referenced data would flush the 



fopen.o 

nalloc.o 

strrcnp.o 



malice .o 
printf .o 
fopen.o 
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entry to load the data segment. Of course, the next instruction would refer- 
ence text and flush the cache entry, again. 

Checking for Compatibility 

The following guidelines explain how to check for upwardly compatible 
shared libraries. Note, however, that upward compatibility may not always be 
an issue. Consider the case in which a shared library is one piece of a larger 
system and is not delivered as a separate product. In this restricted case, you 
can identify all a.out files that use a particular library. As long as you rebuild 
all the a.out files every time the library changes, the a.out files will run suc- 
cessfully, even though versions of the library are not compatible. This may 
complicate development, but it is possible. 

Ciieclcing Versions off Shared Libraries Using chkshlib(l) 

Shared library developers normally want newer versions of a library to be 
compatible with previous ones. As mentioned before, a.out files will not exe- 
cute properly otherwise. 

If you use shared libraries, you might need to find out if different versions 
of a shared library are compatible, or if executable files could have been built 
with a particular host shared library or can run with a particular target shared 
library. For example, you might have a new version of a target shared library, 
and you need to know if all the executable files that ran with the older ver- 
sion will run with the new one. You might need to find out if a particular tar- 
get shared library can reference symbols in another shared library. A com- 
mand, chkshlib(l) (documented in the Programmer's Reference Manual), has 
been provided to allow you to do these and other comparisons. 

chkshlib takes names of target shared libraries, host shared libraries, and 
executable files as input, and checks to see if those files satisfy the compatibil- 
ity criteria, chkshlib checks to see if every library symbol in the first file that 
needs to be matched exists in the second file and has the same address. The 
following table shows what types of files and how many of them chkshlib 
accepts as input. 

The rows listed down represent the first input given, and the columns 
listed across represent any more inputs given. For example, if the first input 
file you give chkshlib is a target shared library, you must give another input 
file that is a target or host shared library. 
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Nothing 


Executable 


Target 


Host 


Executable 


OK 


No 


OK* 


OK* 


Target 


No 


No 


OK 


OK 


Host 


OK 


No 


OK 


OK 



* The executable file must be one that was built using a host shared library. 
A useful way to confirm this is to use dump -L to find out which 
target file(s) gets loaded when the program is run. See dump(l), 
documented in the Programmer's Reference Manual. 

* You can also have executable targetl...targetn and executable hostl...hostn. 

An example of a chkshlib command line is shown below: 

chkshlib /shlib/libc_s /lib/libc_s.a 

In this example, /shlib/libc_s is a target shared library and /lib/libc_s,a is a 
host shared library, chkshlib will check to see if executable hies built with 
/shlib/libc_s would be able to run with /lib/libc_s.a. 

Depending on the input it receives, chkshlib checks to find out if the fol- 
lowing is true: 

■ an executable file will run with the given target shared library 

■ an executable file could have been built using the given host shared 
library 

■ an executable file produced with a given host shared library will run 
v/ith a given target shared library 

■ an executable file that ran with an old version of a target shared 
library will run with a new version 

■ a new host shared library can replace the old host shared library; that 
is, executable files built with the new host shared library will run with 
the old target shared library 

■ a target shared library can reference symbols in another target shared 
library 
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To determine if files are compatible, you have to determine which library 
symbols in the first file need to be matched in the second file. 

■ For target shared libraries, the symbols of concern are all external, 
defined symbols with non-zero values, except for branch labels (branch 
labels always start with .bt), and the special symbols etext, edata, and 
end. 

■ For host shared libraries, the symbols of concern are all external, abso- 
lute symbols with a non-zero value, 

■ For executable files, the symbols of concern are all external, absolute 
symbols with a non-zero value, except for the special symbols etext, 
edata, and end. 

For two files to be compatible, the target pathnames must be identical in both 
files (unless the -i option has been specified). 

The following table displays the output you will receive when you use 
chkshlib to check different combinations of files for compatibility. In this 
table filel represents the name of the first file given, and £ile2,3^... represents 
the names of any more files given as input. 



Input 


Output 


filel is executable 
file2,3,... (if any) are targets 


filel can [may not] execute using file2 
filel can [may not] execute using fileS 


filel is executable 
file2,3 are hosts 


filel may [may not] have been produced using file2 
filel may [may not] have been produced using file3 


filel is host 

file2 (if any) is target 


filel can [may not] produce executables which 
will run with file2 


filel is target 
file2 is host 


file2 can [may not] produce executables which 
will run with filel 


both files are targets or 
both files are hosts 


filel can [may not] replace file2 
file2 can [may not] replace filel 


both files are targets and 
-n option is specified* 


filel can [may not] include file2 



* The -n option tells chkshlib that the two files are target shared libraries. 
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the first of which can reference (include) symbols in the other. See 

"Referencing Symbols in a Shared Library from Another Shared Library" for details. 

For more information on chkshlib, see chkshlib(l), documented in the 
Programmer's Reference Manual, 



When we built the shared C library 

When we built the second version of the shared C library 
and checked it against the first version, chkshlib reported 
that many external symbols had different values and, 
therefore, the second version could not replace the first. 
Here is a list of these symbols: 



—bigpow 
—litpow 
-Jnfl .double 
_infl. single 
-inf2.double 
_inf 2. single 
—invalid.double 
—invalid.single 
_qnanl. double 



_qnanl. single 

-„qnan2.double 

_qnan2. single 

_round.double 

—round.single 

—trap.single 

—type.double 

—type.single 



Since these text symbols were not intended to be user 
entry points, they were not put in the branch table. So 
when new code was added to the shared library the 
addresses of these text symbols changed, and hence their 
values changed. 

We devised the #hide and #export directives to allow 
us to explicitly hide the symbols we did not want to be 
user entry points. In fact, in the latest C Shared Library we 
hid all the symbols, and exported just the ones we want to 
be user entry points. 

You cannot directly reference these functions, and 
these symbols will not be considered incompatible by 
chkshlib in checking the latest version of the shared C 
library with any subsequent version. 
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Dealing with Incompatible Libraries 

When you determine that a newer version of a library can't replace the 
older version, you have to deal with the incompatibility. You can deal with it 
in one of two ways. First, you can rebuild all the a,out files that use your 
library. If feasible, this is probably the best choice. Unfortunately, you might 
not be able to find those a.out files, let alone force their owners to rebuild 
them with your new library. 

So your second choice is to give a different target pathname to the new 
version of the library. The host and target pathnames are independent; so 
you don't have to change the host library pathname. New a.out files will use 
your new target library, but old a.out files will continue to access the old 
library. 

As the library developer, it is your responsibility to check for compatibility 
and, probably, to provide a new target library pathname for a new version of 
a library that is incompatible with older versions. If you fail to resolve com- 
patibility problems, a.out files that use your library will not work properly. 



NOTE 



You should try to avoid multiple library versions. If too many copies of the 
same shared library exist, they might actually use more disk space and more 
memory than the equivalent relocatable version would have. 



An Example 

This section contains the process by which a small specialized shared 
library is created and built. We refer to the guidelines given earlier in this 
chapter. 

The Original Source 

The name of the library to be built is libmaux (for math auxiliary library). 
The interface consists of three functions, an external variable, and a header 
file. 

The three functions: 

logd floating-point logarithm to a given base; defined in the file 

log.c 
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polyd evaluate a polynomial; defined in the file poly.c 

maux— Stat return usage counts for the other two routines in a structured- 
defined in stats.c. 

The external variable: 

mauxerr set to non-zero if there is an error in the processing of any of 
the functions in the library and set to zero if there is no error 
(unlike errno in the C library), 

And the header file: 

maux.h declares the return types of the function and the structure 
returned by maux_stat. 

The source files before any modifications for inclusion in a shared library 
are given below. 
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/♦ log.c V 
^include "mux.h** 
#incliide <mth.h> 

/* 

* Return the log of "x" relative to the base "a". 



♦ logd(base, x) := log(x) / log(]3cLse); 

♦ t^iere "log" is "log to the base E". 
♦/ 

double logcl(bcise, x) 
dcA3ble base, x; 
{ 



extern int stats_logd; 
extern int total_calls; 

double loQ^base; 
double logx; 

total_calls++; 
stats_logd++; 

logbase - log( (double )base); 

logx = log( (double )x) ; 

if (logbase -HUGE 1 1 logs == -HOSE) { 



nauxerr 



retum(O) ; 



) 

else 



muxerr = 0; 
retum(logx/logbase) ; 



} 





Figure 8-5: File log.c 
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r 




/* poly.c */ 

#incl\ide "naux.h" 
#include <inath.h> 

/* Evaluate the polynomial 

* f{x) := a[0] * (X n) + a[1] ♦ (x {n-1)) + ... + a[n]; 

* Note that there are N+1 coefficients I 

* This uses Homer's Method, i^ch is: 

* f(x) := (((({a[0]*x) + a[1])«x) + a[2]) + ...) + a[n]; 

* It's equivalent, but uses nary less cperatians and is more precise. */ 

double polyd(a, n, x) 

double a[]; 

int n; 

double x; 

{ 

extern int stats_polyd; 
extern int total_calls; 
double result; 
int i; 



total_calls++; 
stats_polyd++ ; 
if (n < 0) { 

mauxerr = 1; 

retum(O); 

} 

result = a[0]; 

for (i = 1; i <= n; i++) 

{ 

result ♦= (double )x; 
result += (double)a[i] ; 

} 

nauxerr = 0; 
return ( result ) ; 



V 




Figure 8-6: File poly.c 
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/* stats. c */ 
#iiiclude "mux.h" 

int total_calls = 0; 
int stats_Iogd = 0; 
int stats_polyd = 0; 

ixxt naijxerr; 

/♦ Return structure vdth usage stats for functicais in library 

* or if space cannot be allocated far the structure */ 
struct instats * 
naux_stat( ) 
{ 

extern char * nalloc( ) ; 
struct instats ♦ st; 

if{{st = (struct instats *) nalloc(sizeof (struct mstats))) ~ 0) 

retum(O) ; 
st->st_polydl = stats_polyd; 
st->st_logd = stats_lc9d; 
st->st_total = total_calls; 
retum(st) ; 



Figure 8-7: File stats.c 
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/* naux.h */ 

stacuct mstats { 

int stjpolyd; 
int st_logd; 
int stjbotal; 

}; 

extern double pol3^( ) ; 

extern double logd( ) ; 

extern struct mstats * roavtx_stat( ) ; 

extern int mauxerr; 

V 

Figure 8-8: Header File maux.h 



Choosing Region Addresses and the Target Pathname 

To begin, we choose the region addresses for the library's .text and ,data 
sections from the segments reserved for private use on the 80386 Computer. 
Note that the region addresses must be on a segment boundary (4 MB): 

.text 0x80600000 
.data OxSOaOOOO 

Also v^e choose the pathname for our target library: 

/toy/direc±ory/libmaiJx_s 

Selecting Library Contents 

This example is for illustration purposes, and so we will include every- 
thing in the shared library. In a real case, it is unlikely that you would make 
a shared library with these three small routines, unless you had many pro- 
grammers using them frequently. 
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Rewriting Existing Code 

According to the guidelines given earlier in the chapter, we need to first 
minimize the global data. We realize that total— calls, statS— logd, and 
stats— polyd do not need to be visible outside the library, but are needed in 
multiple files within the library. Hence, we will use the #hide directive in 
our specification file to make these variables static after the shared library is 
built. 

We need to define text and global data in separate source files. The only 
piece of global data we have left is mauxerr, which we v^dll remove from 
stats.c and put in a new file maux— defs.c. We will also have to initialize it to 
zero, since shared libraries cannot have any uninitialized variables. 

Next, we notice that there are some references to symbols that we do not 
define in our shared library (i.e. log and malice). We can import these sym- 
bols. To do so, we create a new header file, import.h, which will be included 
in each of log.c, poly.c, and stats.c. The header file defines C preprocessor 
macros for these symbols to make transparent the use of indirection in the 
actual C source files. We use the _Jibmaux_ prefixes on the pointers to the 
symbols because those pointers are made external, and the use of the library 
name as a prefix helps prevent name conflicts. 

r 

/* New header file i it port.h ♦/ 
#defiJie mallcx: (*_libmuxjnalloc) 
#defaiie log {*_libroa\ax_log) 

extern char * nalloc{ ) ; 
extern double log( } ; 

V 

Now, we need to define the imported symbol pointers somewhere. We 
have already created a file for global data maux— defs.c, so we will add the 
definitions to it. 



J 
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/* Data file maux_defs.c */ 

int mauxerr = 0; 

double (♦_liJ3maux_log)() = 0; 

char ♦ {*_lihroaux_nalloc)( ) = 0; 



Finally, we observe that there are floating-point operations in the code, 
and we remember that the routines for these cannot be imported. (If we tried 
to write the specification file and build the shared library without taking this 
into account, mkshlib would give us errors about unresolved references.) 
This means we will have to use the #objects noload directive in our specifica- 
tion file to search the C host shared library to resolve the references. 



Writing the Specification File 

This is the specification file for libmaux: 
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2 ## libroaux.sl - litnaux specification If lie 

3 #address .text 0x80680000 

4 #address .data Ox806a0000 

5 #target /in//directai:y/liJaiaiix_s 

6 #hranch 

7 polyd 1 

8 logd 2 

9 nauxjstat 3 

10 #bbjects 

11 mux_defs.o 

12 poly.o 

13 log.o 

14 stats. o 

15 #cbjects nolcad 

16 -lc_s 

17 #hide laiiker * 

18 #export liriker 

19 muxerr 

20 #imt mux_defs.o 

21 _litinaux_malloc nalloc 

22 _lilTOauxJLog log 




Figure 8-9: Specification File 



Briefly, here is what the specification file does. Lines 1 and 2 are com- 
ment lines. Lines 3 and 4 give the virtual addresses for the shared library text 
and data regions, respectively. Line 5 gives the pathname of the shared 
library on the target machine. The target shared library must be installed 
there for a.out files that use it to work correctly. Line 6 contains the #branch 
directive. Line 7 through 9 specify the branch table. They assign the func- 
tions polydO, logdO, and maux— statO to branch table entries 1, 1, and 3. 
Only external text symbols, such as C functions, should be placed in the 
branch table. 
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Line 10 contains the #objects directive. Lines 11 through 14 give the list 
of object files that will be used to construct the host and target shared 
libraries. When building the host shared library archive, each file listed here 
will reside in its own archive member. When building the target library, the 
order of object files v^l be preserved. The data files must be first. Otherwise, 
an addition of static data to poly.o, for example, would move external data 
symbols and break compatibility. 

Line 15 contains the #objects noload directive, and line 16 gives informa- 
tion about where to resolve the references to the floating-point routines. 

Lines 17 through 19 contain the #hide linker and #export linker direc- 
tives, which tell what external symbols are to be left external after the shared 
library is built. Together, these #hide and #export directives say that only 
mauxerr will remain external. The symbols in the branch table and those 
specified in the #init directive will remain external by definition. 

Line 20 contains the #init directive. Lines 21 and 22 give imported sym- 
bol information for the object file maux_defs.o. You can imagine assign- 
ments of the symbol values on the right to the symbols on the left. Thus 
— libmaux will hold a pointer to malloc, and so on. 

Building the Shared Library 

Now, we have to compile the .o files as we would for any other library: 
cc -c maux—defs.c poly.c log.c stats.c 

Next, we need to invoke mkshlib to build our host and target libraries: 

mkshlib -s libmaux.sl -t libmaux—S -h libmaux— s.a 

Presuming all of the source files have been compiled appropriately, the 
mkshlib command line shown above will create both the host library, 
libmaux— s.a, and the target library, libmaux.^. Before any a.out files built 
with libmaux— s.a can be executed, the target shared library libmaux^s will 
have to be moved to /my /directory /libmaux_s as specified in the specifica- 
tion file. 

Using the Shared Library 

To use the shared library with a file, x.c, which contains a reference to 
one or more of the routines in libmaux, you would issue the following com- 
mand line: 

cc x.c libmaux— s.a -Im -Ic s 
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This command line causes the following: 

■ the imported symbol pointer reference to log is resolved from libm 

■ the imported symbol pointer reference to malice is resolved with the 
shared version from libc-^. 

The most important thing to note from the command line, however, is that 
you have to specify the C host shared library (in this case with the -Ic— s) on 
the command line, since libmaux was built with direct references to the 
floating-point routines in that library. 
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Summary 

This chapter describes the UNIX System shared libraries and explains how 
to use them. It also explains how to build your own shared libraries. Using 
any shared library almost always saves disk storage space, memory, and com- 
puter power; and running the UNIX System on smaller machines makes the 
efficient use of these resources increasingly important. Therefore, you should 
normally use a shared library whenever it's available. 



SHARED LIBRARIES 8-59 



Interprocess Communication 



Introduction 9-1 



Messages 9-2 

Getting Message Queues 9-7 

■ Using msgget 9-7 

■ Example Program 9-1 1 
Controlling Message Queues 9-15 

■ Using msgctl 9-1 5 

■ Example Program 9-17 
Operations for Messages 9-24 

■ Using msgop 9-24 

■ Example Program 9-26 



Semaphores 9-38 

Using Semaphores 9-40 

Getting Semaphores 9-44 

■ Using semget 9-44 

■ Example Program 9-48 
Controlling Semaphores 9-52 

■ Using semctl 9-53 

■ Example Program 9-55 
Operations on Semaphores 9-67 

■ Using semop 9-67 

■ Example Program 9-69 



Shared Memory 



9-75 



Interprocess Communication 



Using Shared Memory 9-76 

Getting Shared Memory Segments 9-80 

■ Using shmget 9-80 

■ Example Program 9-84 
Controlling Shared Memory 9-88 

■ Using shmctl 9-89 

■ Example Program 9-90 
Operations for Shared Memory 9-99 

■ Using shmop 9-99 

■ Example Program 9-1 01 



Introduction 



The UNIX System supports three types of Inter-Process Communication 
(IPC): 

■ messages 

■ semaphores 

■ shared memory 

This chapter describes the system calls for each type of IPC. Included are 
several example programs that show the use of the IPC system calls. 

Since there are many ways in the C Programming Language to accomplish 
the same task or requirement, keep in mind that the example programs were 
written for clarity and not for program efficiency. Usually, system calls are 
embedded within a larger, user-written program that makes use of a particular 
function that the calls provide. 
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The message type of IPC allows processes (executing programs) to com- 
municate through the exchange of data stored in buffers. This data is 
transmitted between processes in discrete portions called messages. Processes 
using this type of IPC can perform two operations: 

■ sending 

■ receiving 

Before a message can be sent or received by a process, a process must have 
the UNIX System generate the necessary software mechanisms to handle these 
operations. A process does this by using the msgget(2) system call While 
doing this, the process becomes the owner/creator of the message facility and 
specifies the initial operation permissions for all other processes, including 
itself. Subsequently, the owner/creator can relinquish ownership or change 
the operation permissions using the msgctl(2) system call. However, the crea- 
tor remains the creator as long as the facility exists. Other processes with per- 
mission can use msgctl() to perform various other control functions. 

Processes which have permission and are attempting to send or receive a 
message can suspend execution if they are unsuccessful at performing their 
operation. That is, a process which is attempting to send a message can wait 
until the process which is to receive the message is ready and vice versa. A 
process which specifies that execution is to be suspended is performing a 
"blocking message operation." A process which does not allow its execution 
to be suspended is performing a "nonblocking message operation." 

A process performing a blocking message operation can be suspended 
until one of three conditions occurs: 

■ It is successful. 

■ It receives a signal. 

■ The facility is removed. 

System calls make these message capabilities available to processes. The 
calling process passes arguments to a system call, and the system call either 
successfully or unsuccessfully performs its function. If the system call is suc- 
cessful, it performs its function and returns applicable information. Other- 
wise, a known error code (-1) is returned to the process, and an external error 
number variable errno is set accordingly. 
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Before a message can be sent or received, a uniquely identified message 
queue and data structure must be created. The unique identifier created is 
called the message queue identifier (msqid); it is used to identify or reference 
the associated message queue and data structure. 

The message queue is used to store (header) information about each mes- 
sage that is being sent or received. This information includes the following 
for each message: 

■ pointer to the next message on queue 

■ message type 

■ message text size 

■ message text address 

There is one associated data structure for the uniquely identified message 
queue. This data structure contains the following information related to the 
message queue: 

■ operation permissions data (operation permission structure) 

■ pointer to first message on the queue 

■ pointer to last message on the queue 

■ current number of bytes on the queue 

■ number of messages on the queue 

■ maximum number of bytes on the queue 

■ process identification (PID) of last message sender 

■ PID of last message receiver 

■ last message send time 

■ last message receive time 

■ last change time 



NOTE 



All include files discussed in this chapter are located in the /usr /include 
or /usr/include/sys directories. 
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The C Programming Language data structure definition for message infor- 
mation contained in the message queue is located in the header file #include 
<sys/msg.h> and is as follows: 



struct msg 
{ 

strxict msg *insg_ne3ct; /* ptr to next message on q */ 

long msgrJtyP®; /* message type V 

sihort msgjts; /* message text size ♦/ 

short msgjspct; /* message text map address V 

>; 



Likewise, the structure definition for the associated data structure is 
located in the #include <sys/msg.h> header file and is as follows: 



struct rosqid_ds 
{ 



struct ipc_perm msg perm; 

struct msg ^msgrfirst; 

struct msg *rosgjlast; 

ushort msg cbytes; 

u short msgjjnum; 

ushort msgjq^bytes; 

ushort msgJLspid; 

ushort msgJLrpid; 

time_t rosg_stime; 

timejt msg_rtinae; 

timejt msgjctime; 



/* operation pencdssicn struct */ 

/* ptr to first message on q ♦/ 

/♦ ptr to last message on q */ 

/♦ current # tytes en q */ 

/* # of messages on q ♦/ 

/♦ max # of tytes on q ♦/ 

/♦ pid of last msgsnd */ 

/* pid of last msgrcv ♦/ 

/* last msgsnd time ♦/ 

/♦ last msgrcv time */ 

/* last change time */ 



}; 
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Note that the msg—perm member of this structure uses ipc— perm as a tem- 
plate. The breakout for the operation permissions data structure is shown in 
Figure 9-1. 

The definition of the ipc_perm data structure is located in the header file 
#include <sy8/ipc.h> and is as follows: 



struct ipcjpem 
{ 



ushort 


uid; 


/* oMner's user id */ 


ushort 


gid; 


/♦ OMner's groqp id */ 


ushoart 


cuid; 


/* creator's user id */ 


ushort 


ogid; 


/♦ creator's group id ♦/ 


ushort 


msde; 


/♦ access iDodes */ 


ushort 


seq; 


/♦ slot usage sequence nmnber */ 


keyjb 


key; 


/♦ key */ 



}; 



Figure 9-1: ipc_perm Data Structure 



The structure is common for all IPC facilities. 

The msgget(2) system call is used to perform two tasks when only the 
IPC_CREAT flag is set in the msgflg argument that it receives: 

■ to get a new msqid and create an associated message queue and data 
structure for it 

■ to return an existing msqid that already has an associated message 
queue and data structure 

The task performed is determined by the value of the key argument 
passed to the msgget() system call. For the first task, if the key is not already 
in use for an existing msqid, a new msqid is returned with an associated mes- 
sage queue and data structure created for the key. This occurs provided no 
system tunable parameters would be exceeded. 
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There is also a provision for specifying a key of value zero which is 
known as the private key (IPC_PRIVATE = 0); when specified, a new msqid 
is always returned with an associated message queue and data structure 
created for it, unless a system tunable parameter would be exceeded. When 
the ipcs command is performed, for security reasons the KEY field for the 
msqid is all zeros. 

For the second task, if a msqid exists for the key specified, the value of 
the existing msqid is returned. If you do not desire to have an existing msqid 
returned, a control command (IPC F.XCL) can be specified (set) in the msgflg 
argument passed to the system call. The details of using this system call are 
discussed in the "Using msgget" section of this chapter. 

When performing the first task, the process which calls msgget becomes 
the owner/creator, and the associated data structure is initialized accordingly. 
Remember, ownership can be changed but the creating process always 
remains the creator; see the "Controlling Message Queues" section in this 
chapter. The creator of the message queue also determines the initial opera- 
tion permissions for it. 

Once a uniquely identified message queue and data structure are created, 
message operations [msgopO] and message control [msgctl()] can be used. 

Message operations, as mentioned previously, consist of sending and 
receiving messages. System calls are provided for each of these operations; 
they are msgsnd() and msgrcv(). Refer to the "Operations for Messages" sec- 
tion in this chapter for details of these system calls. 

Message control is done by using the msgctl(2) system call. It permits 
you to control the message facility in the following ways: 

■ to determine the associated data structure status for a message queue 
identifier (msqid) 

■ to change operation permissions for a message queue 

■ to change the size (msg— qbytes) of the message queue for a particular 
msqid 

■ to remove a particular msqid from the UNIX Operating System along 
with its associated message queue and data structure 
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Refer to the " Controlling Message Queues " section in this chapter for 
details of the msgctl() system call. 

Getting Message Queues 

This section gives a detailed description of using the msgget(2) system call 
along with an example program illustrating its use. 

Using msgget 

The synopsis found in the msgget(2) entry in the Programmer's Reference 
Manual is as follows: 




#include <sys/types . h> 
#incl'ude <sys/ipc . h> 
#inclu[ie <sysAnsg . h> 



int msgget (key, msgflg) 
keyjt key; 
int msgflg; 




All of these include files are located in the /usr/include/sys directory of 
the UNIX Operating System. 

The following line in the synopsis informs you that msgget() is a function 
with two formal arguments that returns an integer type value upon successful 
completion (msqid). 

int msgget (key, msgflg) 

The next two lines declare the types of the formal arguments, key— t is 
declared by a tjrpedef in the types.h header file to be an integer. 

key_t key; 
int msgflg; 
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The integer returned from this function upon successful completion is the 
message queue identifier (msqid) that was discussed earlier. 

As declared, the process calling the msgget() system call must supply two 
arguments to be passed to the formal key and msgflg arguments. 

A new msqid with an associated message queue and data structure is pro- 
vided if one of the following conditions exists: 

■ key is equal to IPC-PRIVATE 

■ key is passed a unique hexadecimal integer, and msgflg ANDed with 
IPC_CREAT is TRUE. 

The value passed to the msgflg argument must be an integer type octal 
value and will specify the following: 

■ access permissions 

■ execution modes 

■ control fields (commands) 

Access permissions determine the read/ write attributes, and execution 
modes determine the user/group/other attributes of the msgflg argument. 
They are collectively referred to as "operation permissions." Figure 9-2 
reflects the numeric values (expressed in octal notation) for the valid operation 
permissions codes. 



Operation Permissions 



Octal Value 



Read by User 



00400 
00200 
00040 
00020 
00004 
00002 



Write by User 
Read by Group 
Write by Group 
Read by Others 
Write by Others 



Figure 9-2; Operation Permissions Codes 
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A specific octal value is derived by adding the octal values for the opera- 
tion permissions desired. That is, if read by user and read/ write by others is 
desired, the code value would be 00406 (00400 plus 00006). There are con- 
stants located in the msg.h header file which can be used for the user 
(OWNER). 

Control commands are predefined constants (represented by all uppercase 
letters). Figure 9-3 contains the names of the constants which apply to the 
msggetO system call along with their values. They are also referred to as 
flags and are defined in the ipch header file. 



Control Command 


Value 


IPC_CREAT 
IPC_EXCL 


0001000 
0002000 



Figure 9-3: Control Commands (Flags) 



The value for the msgflg argument is, therefore, a combination of opera- 
tion permissions and control commands. After determining the value for the 
operation permissions as previously described, the desired flag(s) can be speci- 
fied. This specification is accomplished by bitwise ORing (I) them with the 
operation permissions; bit positions and values for the control commands in 
relation to those of the operation permissions make this possible. An example 
of determining the msgflg argument follows. 





Octal Value 


Binary Value 


IPC_CREAT 


1000 


000 001 000 000 000 


1 ORed by User = 


0040 


000 000 100 000 000 


msgflg 


1400 


000 001 100 000 000 



The msgflg value can easily be set by using the names of the flags in con- 
junction with the octal operation permissions value: 

msqid = msgget (key, (IPC_CREAT | 0400)); 

msqid = msgget (key, (IPCCREAT | IPCJSCCL | 0400)); 
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As specified by the msgget(2) page in the Programmer's Reference Manual, 
success or failure of this system call depends upon the argument values for 
key and msgflg or system tunable parameters. The system call will attempt 
to return a new msqid if one of the following conditions exists: 

■ key is equal to IPC-PRIVATE (0) 

■ key does not already have a msqid associated with it, and (msgflg & 
IPC_CREAT) is TRUE (not zero). 

The key argument can be set to IPC— PRIVATE in the following ways: 
msqid = msgget (rPC_EEaVATE, msgflg); 

or 

msqid = msgget ( , msgflg); 

This alone will cause the system call to be attempted because it satisfies the 
first condition specified. Exceeding the MSGMNI system tunable parameter 
always causes a failure. The MSGMNI system tunable parameter determines 
the maximum number of unique message queues (msqid's) in the UNIX 
Operating System. 

The second condition is satisfied if the value for key is not already associ- 
ated with a msqid and the bitwise ANDing of msgflg and IPG—GREAT is 
TRUE (not zero). This means that the key is unique (not in use) within the 
UNIX Operating System for this facility type and that the IPG-GREAT flag is 
set (msg^g I IPG_GREAT). The bitwise ANDing (&), which is the logical way 
of testing if a flag is set, is illustrated as follows: 

msgflg = x1xxx (x = iinraaterial) 
IPC_CREAT = 01000 

result =01000 (not zero) 

Since the result is not zero, the flag is set or TRUE. 

IPG-JEXGL is another control command used in conjunction with 
IPG_GREAT to exclusively have the system call fail if, and only if, a msqid 
exists for the specified key provided. This is necessary to prevent the process 
from thinking that it has received a new (unique) msqid when it has not. In 
other words, when both IPG-GREAT and IPC_EXGL are specified, a new 
msqid is returned if the system call is successful. 
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Refer to the msgget(2) page in the Programmer's Reference Manual for 
specific, associated data structure initialization for successful completion. The 
specific failure conditions with error names are contained there also. 

Example Program 

The example program in this section (Figure 9-4) is a menu-driven pro- 
gram which allows all possible combinations of using the msgget(2) system 
call to be exercised. 

From studying this program, you can observe the method of passing argu- 
ments and receiving return values. The user-written program requirements 
are pointed out. 

This program begins (lines 4-8) by including the required header files as 
specified by the msgget(2) entry in the Programmer's Reference Manual. Note 
that the errnch header file is included as opposed to declaring errno as an 
external variable; either method will work. 

Variable names have been chosen to be as close as possible to those in the 
synopsis for the system call. Their declarations are self-explanatory. These 
names make the program more readable, and are perfectly legal since they are 
local to the program. Variables declared for this program and their purposes 
are as follows: 

■ key — is used to pass the value for the desired key. 

■ oppenn — is used to store the desired operation permissions. 

■ flags — is used to store the desired control commands (flags). 

■ opperm— flags — is used to store the combination from the logical 
ORing of the opperm and flags variables; it is then used in the system 
call to pass the msgflg argument. 

■ msqid — is used for returning the message queue identification number 
for a successful system call or the error code (-1) for an unsuccessful 
one. 

The program begins by prompting for a hexadecimal key, an octal opera- 
tion permissions code, and the control command combinations (flags) which 
are selected from a menu (lines 15-32). All possible combinations are allowed 
even though they might not be viable. This allows observing the errors for 
illegal combinations. 
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Next, the menu selection for the flags is combined with the operation per- 
missions, and the result is stored at the address of the opperm^ags variable 
(lines 36-51). 

The system call is made next, and the result is stored at the address of the 
msqid variable (line 53). 

Since the msqid variable now contains a valid message queue identifier or 
the error code (-1), it is tested to see if an error occurred (line 55). If msqid 
equals -1, a message indicates that an error resulted, and the external errno 
variable is displayed (lines 57 and 58). 

If no error occurred, the returned message queue identifier is displayed 
(line 62). 

The example program for the msgget(2) system call follows. It is sug- 
gested that the source program file be named msgget.c and that the executable 
file be named msgget. 

When compiling C programs that use floating point operations, the -£ 
option should be used on the cc command line. If this option is not used, the 
program will compile successfully, but when the program is executed, it will 
fail. 
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f ^ ^ 

1 /*Diis is a program to illustrate 

2 ♦*the message get, insgget( ) , 

3 ♦♦system call capabilities. ♦/ 

4 #include <stdio.h> 

5 #iiiclude <sys/l:ypes,h> 

6 #include <sys/ipc.h> 

7 #include <sysAnsg.h> 

8 #include <ernx>.h> 

9 /♦Start of main C language program^/ 

10 mainO 

11 { 

12 kfiyjb key; /♦declare as long integer^/ 

13 int qp pem , flags; 

14 int rnsqid, qppermJElags; 

15 /♦Enter the desired key^/ 

16 printf ("Enter the desired key in hex = "); 

17 scanf("96x", &key); 

18 /♦Enter the desired octal cperaticn 

19 perraissions . ♦/ 

20 printf ( "NnEnter the operationNn" ) ; 

21 printf ("pennissions in octal « "); 

22 scanf("?fo", &c?3perm); 

V J 

Figure 9-4: msgget() System Call Example (Sheet 1 of 3) 
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23 /*Set the desired flags.*/ 

24 printf ( "NnEnter oorrespcjnding nmnber to\n" ) ; 

25 printf ( "set the desired flags :Nn" ) ; 

26 printf ( "No flags = 0\n" ) ; 

27 printf ("IFC_CREAT = 1\n"); 

28 printf ( "IICJXCL = 2\n" ) ; 

29 printf ( "IPCO^EAT and HCJXCL = 3\n" ) ; 

30 printf (" Flags = "); 

31 /«Get the flag(s) to be set.*/ 

32 scanf("Xd", Cflags); 

33 /♦Check the values.*/ 

34 printf ("Nnfcey =0x?6c, qppenn = 0%o, flags = 05to\n", 

35 key, oippenn, flags); 

36 /*Inoarparate the oca itr ol fields (flags) with 

37 the operation perroissicns*/ 

38 switch (flags) 

39 { 

40 case 0: /*No flags are to be set,*/ 

41 opperm_flags = (cpperm | 0); 

42 break; 

43 case 1: /*Set the IPC_CREAT flag.*/ 

44 oppenn_flags = (cpperm | IPCCREAT); 

45 break; 

46 case 2: /*Set the IFCJEXCL flag.*/ 

47 qppennflags = (cpperm | IPC_E»CL); 

48 break; 

49 case 3: /*Set the IPCJPEAT and IPC_EXCL flags.*/ 

50 Qppennflags = (cpperm | IPC_CREAT | IPC_E»CL); 

51 } 

V J 

Figure 9-4: insgget() System Call Example (Sheet 2 of 3) 
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52 /*Call the insgget system call.*/ 

53 msqid = insgget (key, qppem_flags); 



54 /♦Perfam the following if the call is unsuccessful.*/ 

55 if (msqid == -1) 

56 { 

57 prantf ("NjiHie insgget system call failed !\n"); 

58 printf ( "The ermr nuDDber = ?6d\ii" , ermo) ; 

59 } 

60 /*Retum the msqid \Jpaa successful ocnpletian.*/ 

61 else 

62 printf ("NnUie msqid = JfidNn", msqid); 

63 exit(O); 

64 } 




Figure 9-4: msggetO System Call Example (Sheet 3 of 3) 



Controlling Message Queues 

This section gives a detailed description of using the msgctl system call. It 
also provides an example program which allows all of its capabilities to be 
exercised. 

Using msgctl 

The synopsis found in the msgctl(2) entry in the Programmer's Reference 
Manual is as follows: 
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#iiiclude <sys/types.h> 
#include <syB/ipc.h> 
#d2iclude <sysy1nsg.h> 

int nsgctl (insqid, and, buf ) 
iixt mscpxi, and; 
struct insqid_ds ♦buf; 



The msgctlO system call requires three arguments to be passed to it, and it 
returns an integer value. Upon successful completion, a zero value is 
returned. When unsuccessful, a -1 is returned. 

The msqid variable must be a valid, non-negative, integer value. In other 
words, it must have already been created by using the msgget() system call. 

The cmd argument can be replaced by one of the following control com- 
mands (flags): 

IPC_STAT returns the status information contained in the associated data 
structure for the specified msqid, and places it in the data 
structure pointed to by the *buf pointer in the user memory 
area. 

IPC—SET for the specified msqid, sets the effective user and group iden- 
tification, operation permissions, and the number of bytes for 
the message queue. 

IPC—RMID removes the specified msqid along with its associated mes- 
sage queue and data structure. 



A process must have an effective user identification of 
OWNER/CREATOR or super-user to perform an IPC-SET or IPC_RMID con- 
trol command. Read permission is required to perform the IPC_STAT control 
command. 
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The details of this system call are discussed in the example program for it. 
If you have problems understanding the logic manipulations in this program, 
read the "Using msgget" section of this chapter; it goes into more detail than 
would be practical to do for every system call. 

Example Program 

The example program in this section (Figure 9-5) is a menu-driven pro- 
gram which allows all possible combinations of using the msgctl(2) system 
call to be exercised. 

From studying this program, you can observe the method of passing argu- 
ments and receiving return values. The user-written program requirements 
are pointed out. 

This program begins (lines 5-9) by including the required header files as 
specified by the msgctl(2) entry in the Programmer's Reference Manual. Note 
in this program that errno is declared as an external variable, and therefore, 
the ermo,h header file does not have to be included. 

Variable and structure names have been chosen to be as close as possible 
to those in the synopsis for the system call. Their declarations are self- 
explanatory. These names make the program more readable, and are perfectly 
legal since they are local to the program. Variables declared for this program 
and their purpose are as follows: 

■ uid — used to store the IPC_SET value for the effective user identifica- 
tion 

■ gid — used to store the IPC_SET value for the effective group identifi- 
cation 

■ mode — used to store the IPC_SET value for the operation permissions 

■ bytes — used to store the IPC— SET value for the number of bytes in 
the message queue (msg— qbytes) 

■ rim — used to store the return integer value from the system call 

■ msqid — used to store and pass the message queue identifier to the 
system call 

■ command — used to store the code for the desired control command so 
that subsequent processing can be performed on it 



INTERPROCESS COMMUNICATION 9-17 



Messages 

■ choice — used to determine which member is to be changed for the 
IPC_SET control command 

■ msqid_ds — used to receive the specified message queue identifier's 
data structure when an IPC— STAT control command is performed 

■ *buf — a pointer passed to the system call which locates the data struc- 
ture in the user memory area where the IPC—STAT control command is 
to place its return values or where the IPC SET command gets the 
values to set 

Note that the msqid_ds data structure in this program (line 16) uses the 
data structure located in the msg.h header file of the same name as a template 
for its declaration. This is a perfect example of the advantage of local vari- 
ables. 

The next important thing to observe is that, although the *buf pointer is 
declared to be a pointer to a data structure of the msqid_ds type, it must also 
be initialized to contain the address of the user memory area data structure 
(line 17). Now that all of the required declarations have been explained for 
this program, this is how it works. 

First, the program prompts for a valid message queue identifier which is 
stored at the address of the msqid variable (lines 19 and 20). This is required 
for every msgctl system call. 

Then the code for the desired control command must be entered (lines 
21-27), and it is stored at the address of the command variable. The code is 
tested to determine the control command for subsequent processing. 

If the IPC_STAT control command is selected (code 1), the system call is 
performed (lines 37 and 38) and the status information returned is printed out 
(lines 39-46); only the members that can be set are printed out in this pro- 
gram. Note that if the system call is unsuccessful (line 106), the status infor- 
mation of the last successful call is printed out. In addition, an error message 
is displayed and the errno variable is printed out (lines 108 and 109). If the 
system call is successful, a message indicates this, along with the message 
queue identifier used (lines 111-114). 

If the IPC—SET control command is selected (code 2), the first thing done 
is to get the current status information for the message queue identifier speci- 
fied (lines 50-52). This is necessary because this example program provides 
for changing only one member at a time, and the system call changes all of 
them. Also, if an invalid value happened to be stored in the user memory 
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area for one of these members, it would cause repetitive failures for this con- 
trol command until corrected. The next thing the program does is to prompt 
for a code corresponding to the member to be changed (lines 53-59), This 
code is stored at the address of the choice variable (line 60). Now, depending 
upon the member picked, the program prompts for the new value (lines 66- 
95). The value is placed at the address of the appropriate member in the user 
memory area data structure, and the system call is made (lines 96-98). 
Depending upon success or failure, the program returns the same messages as 
for IPC_STAT above. 

If the IPC—RMID control command (code 3) is selected, the system call is 
performed (lines 100-103), and the msqid along with its associated message 
queue and data structure are removed from the UNIX Operating System. 
Note that the *buf pointer is not required as an argument to perform this con- 
trol command, and its value can be zero or NULL. Depending upon the suc- 
cess or failure, the program returns the same messages as for the other control 
commands. 

The example program for the msgctl() system call follows. It is suggested 
that the source program file be named msgctLc and that the executable file be 
named msgctL 

When compiling C programs that use floating point operations, the -f 
option should be used on the cc command line. If this option is not used, the 
program will compile successfully, but when the program is executed, it will 
fail. 
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1 /*flhis is a progr a m to illustrate 

2 **the message oantrol, msgctl(), 

3 **system Ccdl capabilities. 

4 ♦/ 

5 /*Incl\2de necessary header files.*/ 

6 #incliide <stdio.h?- 

7 #iiicluE3e <fiys/types.h> 

8 #include <sfys/ipc.h> 

9 #incl\B3e <sys/tasg.h> 

10 /♦Start of main C language program*/ 

11 mainO 

12 { 

13 extern int ermo; 

14 int uid, gid, mode, b/tes; 

15 int rtm, msqid, command, choice; 

16 struct msqpLdjJs msqid_ds, ♦tjuf ; 

17 buf = &rosqid_ds; 

18 /*Get the msqid, and ooranand.*/ 

19 printf( "Enter the msqpLd = '•); 

20 scanf("«d", Smsqid); 

21 printf ("NnEnter the number forNn"); 

22 printfC'the desired oonnmand:\n") ; 

23 printf("IPC_STAT = INn"); 

24 printf('»IPC_SET = 2\n"); 

25 printf("IPC_BMID = 3\n"); 

26 printf( "Entry = "); 

27 scanf("%d", &coninand); 

V J 

Figure 9-5: msgctl() System Call Example (Sheet 1 of 4) 
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28 /*Check the values.*/ 

29 printf ("\nmsqid =96d, ocnnand = 96d\n", 

30 msqid, ocznnand) ; 

31 svdtch (ooramand) 

32 { 

33 case 1: /*Use insgctl() to du^dicate 

34 the data structure for 

35 msqid in the iiisqid__ds area pointed 

36 to by buf and then print it out.*/ 

37 rtm = insgctl( msqid, IPC_STAT, 

38 buf); 

39 printf ("Nnllifi USER ID = %a\n", 

40 buf^>rosq_penn.uid) ; 

41 printf ("The GROUP ID = %d\ji", 

42 buf->insg_penn.gid) ; 

43 printf ( "Ihe operation permissians = 09fo\n" , 

44 buf->iteq perm.roode) ; 

45 printf ( '"Bie msg_cfertes = %d\n" , 

46 buf->insg_qtytes) ; 

47 break; 

48 case 2: /*Select and change the desired 

49 inatt>er(s) of the data structure.*/ 

50 /*Get the original data for this msqid 

51 data structure first.*/ 

52 rtm = msgctl (msqid, IPC_STAT, buf); 

53 printf {"\nEnter the number for the\n"); 

54 printf ("member to be changed:\n"); 

55 printf ("msg[_perm.uid = ISn"); 

56 printf ("mscLperm.gid = 2\n"); 

57 printf ("msg^perm. mode = 3\n"); 

58 printf ( "msgofljytes = 4\n"); 

59 printf("Entry = "); 

V J 

Figure 9-5: msgctlQ System Call Example (Sheet 2 of 4) 
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60 SGan£("9£d", ^choice); 

61 /*Qnly one ctoice is allowed per 

62 pass as an illegal entry will 

63 caiise repetitive failtures until 

64 msqidjils is ij^xSated with 

65 IFC_STAT.*/ 

66 switch( choice) { 

67 case 1: 

68 prmtf{"\nEtater USER ID = "); 

69 scanf ("Xd", Sand); 

70 buf->insg_penn.md = uid; 

71 printf("\jflJSSl ID = ?6a\n", 

72 baf->insgLperm.uid) ; 

73 break; 

74 case 2: 

75 printf("\nEaiter GROUP ID = "); 

76 scanf("%i", &gid); 

77 buf->insgjpenn.gid = gid; 

78 prantf("\riaO]P ID = %3\n", 

79 buf->ins q^p erm. gid ) ; 

80 break; 

81 case 3: 

82 printf ("NnEnter MOEE = "); 

83 scanf("56o", Snode); 

84 buf->ins q p erm.node = inode; 

85 printf ("NnMOCE = 09fo\n", 

86 buf^>insq perm.node) ; 

87 break; 

V J 

Figure 9-5: msgctlQ System Call Example (Sheet 3 of 4) 
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( \ 

88 case 4: 

89 printf ( "NnEnter nisq_bytes = " ) ; 

90 scaiif{"%d", kbytes); 

91 baf^>insg_o^jytes = hytes; 

92 printf ("NiimsgLcfjytes = 96dNn", 

93 buf->insgL<#ytes); 

94 break; 

95 } 

96 /*Do the change.*/ 

97 rtzn = iiisgctl(n]sc[id, IFC_SET, 

98 buf); 

99 break; 

100 case 3: /*Renove the rasqid alciig with its 

101 associated message queue 

102 and data structure.*/ 

103 rtm = msgctl(msqid, IPC_RMID, NULL); 

104 } 

105 /*PerfG(m the following if the call is unsuccessful.*/ 

106 if(rtm==-1) 

107 { 

108 printf ("NnStoe nsgctl system call failed !\n"); 

109 printf ( "Uie error nuntoer = ?6d\n" , ermo) ; 

110 } 

111 /♦Petum the msqid \jqpan successful ocnpleticn.*/ 

112 else 

113 printf ("NnMsgctl was successful for msqid = 96dNn", 

114 msqid); 

115 exit (0); 

116 } 

V J 

Figure 9-5: msgctl() System Call Example (Sheet 4 of 4) 
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Operations for Messages 

This section gives a detailed description of using the msgsnd(2) and 
msgrcv(2) system calls, along with an example program which allows all of 
their capabilities to be exercised. 

Using msgop 

The synopsis found in the msgop(2) entry in the Programmer's Reference 
Manual is as follows: 



#ijMlude <sys/"types .h> 
#iiiclude <sys/ipc.h> 
#ixK:liide <sys/tasg.h> 

Int msgsnd (msqid, ntsgp, msgsz, msgflg) 
int msqid; 

struct insgbuf *insgp; 
int msgsz, msgflg; 

int iDsgrcv (ntsqld, cisgp, msgsz, ntsgtyp, msgflg) 
int msqid; 

struct ms^Mf %isgp; 
int msgsz; 
Icng msgtyp; 
int msgflg; 



Sending a Message 

The msgsnd system call requires four arguments to be passed to it, and it 
returns an integer value. Upon successful completion, a zero value is 
returned. When unsuccessful, a -1 is returned. 

The msqid argument must be a valid, non-negative, integer value. In 
other words, it must have already been created by using the msgget() system 
call 
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The msgp argument is a pointer to a structure in the user memory area 
that contains the type of the message and the message to be sent. 

The msgsz argument specifies the length of the character array in the data 
structure pointed to by the msgp argument. This is the length of the message. 
The maximum size of this array is determined by the MSGMAX system tun- 
able parameter. 

The msg— qbytes data structure member can be lowered from MSGMNB 
by using the msgctl() IPC_SET control command, but only the super-user can 
raise it afterwards. 

The msgflg argument allows the "blocking message operation" to be per- 
formed if the IPC-NOWAIT flag is not set (msgflg & IPC-NOWAIT = 0); 
this would occur if the total number of bytes allowed on the specified message 
queue are in use (msg— qbytes or MSGMNB), or the total system-wide 
number of messages on all queues is equal to the system imposed limit 
(MSGTQL). If the IPC_NOWAIT flag is set, the system call will fail and 
return a -1. 

Further details of this system call are discussed in the example program 
for it. If you have problems understanding the logic manipulations in this 
program, read the "Using msgget" section of this chapter; it goes into more 
detail than would be practical to do for every system call. 

Receiving IMessages 

The msgrcvO system call requires five arguments to be passed to it, and it 
returns an integer value. Upon successful completion, a value equal to the 
number of bytes received is returned. When unsuccessful, a -1 is returned. 

The msqid argument must be a valid, non-negative, integer value; that is, 
it must have already been created by using the msgget() system call. 

The msgp argument is a pointer to a structure in the user memory area 
that will receive the message type and the message text. 

The msgsz argument specifies the length of the message to be received. If 
its value is less than the message in the array, an error can be returned if 
desired; see the msgflg argument. 

The msgtyp argument is used to pick the first message on the message 
queue of the particular type specified. If it is equal to zero, the first message 
on the queue is received; if it is greater than zero, the first message of the 
same type is received; if it is less than zero, the lowest type that is less than or 
equal to its absolute value is received. 
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The msgflg argument allows the "blocking message operation" to be per- 
formed if the IPC-JMOWAIT flag is not set (msgflg & IPC_NOWAIT = 0); 
this would occur if there is not a message on the message queue of the desired 
type (msgtyp) to be received. If the IPC_NOWAIT flag is set, the system call 
will fail immediately when there is not a message of the desired type on the 
queue, msgflg can also specify that the system call fail if the message is 
longer than the size to be received; this is done by not setting the 
MSG_NOERROR flag in the msgflg argument (msgflg & MSG_NOERROR = 
0). If the MSG_NOERROR flag is set, the message is truncated to the length 
specified by the msgsz argument of msgrcv(). 

Further details of this system call are discussed in the example program 
for it. If you have problems understanding the logic manipulations in this 
program, read the "Using msgget" section of this chapter; it goes into more 
detail than would be practical to do for every system call. 

Example Program 

The example program in this section (Figure 9-6) is a menu-driven pro- 
gram which allows all possible combinations of using the msgsnd() and 
msgrcv(2) system calls to be exercised. 

From studying this program, you can observe the method of passing argu- 
ments and receiving return values. The user-written program requirements 
are pointed out. 

This program begins (lines 5-9) by including the required header files as 
specified by the msgop(2) entry in the Programmer's Reference Manual, Note 
that in this program errno is declared as an external variable, and therefore, 
the errnch header file does not have to be included. 

Variable and structure names have been chosen to be as close as possible 
to those in the synopsis for the system call. Their declarations are self- 
explanatory. These names make the program more readable, and are perfectly 
legal since they are local to the program. Variables declared for this program 
and their purposes are: 

■ sndbuf — is used as a buffer to contain a message to be sent (line 13); 
it uses the msgbufl data structure as a template (lines 10-13). The 
msgbufl structure (lines 10-13) is almost an exact duplicate of the 
msgbuf structure contained in the msg.h header file. The only differ- 
ence is that the character array for msgbufl contains the maximum 
message size (MSGMAX) for your computer, where in msgbuf it is set 
to one (1) to satisfy the compiler. For this reason msgbuf cannot be 
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used directly as a template for the user-written program. It is there so 
you can determine its members. 

■ rcvbuf — is used as a buffer to receive a message (line 13); it uses the 
msgbufl data structure as a template (lines 10-13). 

■ "i^msgp — is used as a pointer (line 13) to both the sndbuf and rcvbuf 
buffers. 

■ i — is used as a counter to input characters from the keyboard, to store 
them in the array, and to keep track of the message length for the 
msgsndO system call; it is also used as a counter to output the received 
message for the msgrcv() system call. 

■ c — is used to receive the input character from the getchar() function 
(line 50). 

■ flag — is used to store the code of IPC_NOWAIT for the msgsnd() sys- 
tem call (line 61). 

■ flags — is used to store the code of the IPC_NOWAIT or 
MSG_NOERROR flags for the msgrcvQ system call (line 117). 

■ choice — is used to store the code for sending or receiving (line 30). 

■ rtrn — is used to store the return values from all system calls. 

■ msqid — is used to store and pass the desired message queue identifier 
for both system calls. 

■ msgsz — is used to store and pass the size of the message to be sent or 
received. 

■ msgflg — is used to pass the value of flag for sending or the value of 
flags for receiving. 

■ msgt)^ — is used for specifying the message type for sending, or used 
to pick a message type for receiving. 

Note that a insqid_ds data structure is set up in the program (line 21) 
with a pointer which is initialized to point to it (line 22); this will allow the 
data structure members that are affected by message operations to be 
observed. They are observed by using the msgctl() (IPC_STAT) system call 
to get them for the program to print them out (lines 80-92 and lines 161-168). 
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The first thing the program prompts for is whether to send or receive a 
message. A corresponding code must be entered for the desired operation, 
and it is stored at the address of the choice variable (lines 23-30). Depending 
upon the code, the program proceeds as in the foUov^ing msgsnd or msgrcv 
sections. 

msgsnd 

When the code is to send a message, the msgp pointer is initialized (line 
33) to the address of the send data structure, sndbuf . Next, a message type 
must be entered for the message; it is stored at the address of the variable 
msgtjrp (line 42), and then (line 43) it is put into the mtype member of the 
data structure pointed to by msgp. 

The program now prompts for a message to be entered from the keyboard 
and enters a loop of getting and storing into the mtext array of the data struc- 
ture (lines 48-51). This will continue until an end of file is recognized, which 
for the getcharO function is a control-d (CTRL-D) immediately following a 
carriage return (<CR>). When this happens, the size of the message is deter- 
mined by adding one to the i counter (lines 52 and 53), as it stored the mes- 
sage beginning in the zero array element of mtext. Keep in mind that the 
message also contains the terminating characters, and the message will there- 
fore appear to be three characters short of msgsz. 

The message is immediately echoed from the mtext array of the sndbuf 
data structure to provide feedback (lines 54-56). 

The next and final thing that must be decided is whether to set the 
IPC—NOWAIT flag. The program does this by requesting that a code of a 1 
be entered for yes or anything else for no (lines 57-65). It is stored at the 
address of the flag variable. If a 1 is entered, IPCUSIOWAIT is logically ORed 
with msgflg; otherwise, msgflg is set to zero. 

The msgsndO system call is performed (line 69). If it is unsuccessful, a 
failure message is displayed along with the error number (lines 70-72). If it is 
successful, the returned value is printed, which should be zero (lines 73-76). 

Every time a message is successfully sent, there are three members of the 
associated data structure which are updated. They are described as follows: 

msg_qnum represents the total number of messages on the message 
queue; it is incremented by one. 

msg—lspid contains the Process Identification (PID) number of the last 
process sending a message; it is set accordingly. 
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msg stime contains the time in seconds since January 1, 1970, 

Greenwich Mean Time (GMT) of the last message sent; it is 
set accordingly. 

These members are displayed after every successful message send opera- 
tion (lines 79-92). 

msgrcv 

If the code specifies that a message is to be received, the program contin- 
ues execution as in the following paragraphs. 

The msgp pointer is initialized to the rcvbuf data structure (line 99). 

Next, the message queue identifier of the message queue from which you 
will receive the message is requested, and it is stored at the address of msqid 
(lines 100-103). 

The message type is requested, and it is stored at the address of msgtyp 
(lines 104-107). 

The code for the desired combination of control flags is requested next, 
and it is stored at the address of flags (lines 108-117). Depending upon the 
selected combination, msgflg is set accordingly (lines 118-133). 

Finally, the number of bytes to be received is requested, and it is stored at 
the address of msgsz (lines 134-137). 

The msgrcvO system call is performed (line 144). If it is unsuccessful, a 
message and error number is displayed (lines 145-148). If successful, a mes- 
sage indicates so, and the number of bytes returned is displayed followed by 
the received message (lines 153-159). 

When a message is successfully received, there are three members of the 
associated data structure which are updated; they are described as follows: 

msg— qnum contains the number of messages on the message queue; it is 
decremented by one. 

msg—lrpid contains the process identification (PID) of the last process 
receiving a message; it is set accordingly. 

msg rtime contains the time in seconds since January 1, 1970, 

Greenwich Mean Time (GMT) that the last process received 
a message; it is set accordingly. 
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The example program for the msgop() system calls follows. It is sug- 
gested that the program be put into a source file called msgop.c and then into 
an executable file called msgop. 

When compiling C programs that use floating point operations, the -£ 
option should be used on the cc command line. If this option is not used, the 
program will compile successfully, but when the program is executed it will 
fail The -f option is not required, however, for your computer. 



9-30 PROGRAMMER'S GUIDE 



Messages 



1 /*^niis is a program to illiistrate 

2 **tiie message operations, msgqpO, 

3 **system call capabilities. 

4 ♦/ 

5 /*Iiiclude necessary header files.*/ 

6 #iiiclude <stdio.h> 

7 ^include <sys/types .h> 

8 #iiiclude <sys/ipc.h> 

9 #ijiclude <sys/tasg.h> 

10 struct msgbuf 1 { 

11 long ratype; 

12 char nrt:e3ct[8192]; 

13 } sndbuf, rc\rt3uf, *rosgp; 

14 /♦Start of min C language program*/ 

15 roadnO 

16 { 

17 extern int ermo; 

18 int i, c, flag, flags, choice; 

19 int rtm, msqid, msgsz, msgflg; 

20 Iccng lotype, msgtyp; 

21 stnict msqid_ds msqid_ds, *buf ; 

22 buf = 6jQsqid_ds; 

V J 



Figure 9-6: msgopQ System Call Example (Sheet 1 of 7) 
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23 /*Select the desired operation.*/ 

24 printf ("Enter the oorrespoixlingNn" ) ; 

25 printf("oode to send or\n"); 

26 printf { "receive a message : \n" ) ; 

27 printf("Send = 1\n"); 

28 printf( "Receive = 2\n"); 

29 printf( "Entry = "); 

30 scanf("%d", Sctoice); 

31 if (choice == 1) /*Send a message.*/ 

32 { 

33 msgp = &sndbuf ; /*Poiiit to user send structure.*/ 

34 printf ("NnEnter the rosqid ofNn"); 

35 printf ("the message qaeue to\n"); 

36 printf ("handle the message = "); 

37 scanf("5^d", &msqid) ; 

38 /*Set the message type.*/ 

39 printf ("NnEnter a positive integerXn"); 

40 printf ( "message type (lo^) for the\n"); 

41 printf ("message = "); 

42 scanf("?6d", Srosgtyp); 

43 msgp->intype = msg^^; 

44 /*Enter the message to send.*/ 

45 printf ( "NnEnter a message; Nn"); 

46 /*A ocntrol-d (M) terminates as 

47 EOF.*/ 

V J 

Figure 9-6: msgop() System Call Example (Sheet 2 of 7) 
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48 /*Qet each character of the message 

49 and put it in the iiitext array.*/ 

50 for(i = 0; {{c = getcdiar( ) ) 1= EOF); 

51 snclbu£.intext[i] = c; 

52 /*Detennine the message size,*/ 

53 msgsz = i + 1; 

54 /*BchD the message to send.*/ 

55 for(i = 0; i < msgsz; i++) 

56 patGhar(sndbu£.mtext[i]) ; 

57 /*Set the IPCJCWAIT flag if 

58 desired.*/ 

59 prijitf ("NnEnter a 1 if you viant the\n"); 

60 printfC'the nCJTOHAIT flag set: 

61 scanf("96d", &flag); 

62 if (flag ~ 1) 

63 msgflg |= IPC_NOWAIT; 

64 else 

65 msgflg = 0; 

66 /*Check the msgflg.*/ 

67 printf("\nmsgflg = 0?fc>\n", msgflg); 

68 /*Send the message,*/ 

69 rtm = msgsnd(msqid, msgp, msgsz, msgflg); 

70 if (rtm = -1) 

71 printf("\nMsgswa failed. Error = %d\n", 

72 ermo); 

73 else { 

74 /*Erint the value of test vihich 

75 should be zero for successful.*/ 

76 printf ( "\nValue returned = %a\n" , rtm) ; 



Figure 9-6: msgop() System Call Example (Sheet 3 of 7) 



INTERPROCESS COMMUNICATION 9-33 



Messages 



f ^ 

77 /*Print the size of the message 

78 sent.*/ 

79 printf("\nMsgsz = 5«d\n", msgsz); 

80 /*Check the data structure ijp3ate.*/ 

81 insgc±l(msqid, IPC_aEAT, buf); 

82 /*Print out the affected menibers. V 

83 /♦Print the iiKareroented mrnber of 

84 loessages on the queue. V 

85 printf ("\jfl3ie msgjjimn = 96d\n", 

86 buf->msgL<p™) ; 

87 /*Print the ycoosss id of the last sender.*/ 

88 printf ("The msgJLspid = XdNn", 

89 buf^>msgLlspid) ; 

90 /*Print the last send time.*/ 

91 printf ("rn^e msgLstime = %dNai", 

92 buf->ttisg_stiine); 

93 } 

94 } 

95 if (choice == 2) /*Receive a message.*/ 

96 { 

97 /»Ihitialize the message pointer 

98 to the receive buffer.*/ 

99 msgp = fifc^Asuf ; 

100 /*Specify the message queue i^ch contains 

101 the desired message.*/ 

102 prantf("\nEnter the insqid = "); 

103 scanf("%a", tosqid); 

V J 

Figure 9-6: msgopQ System Call Example (Sheet 4 of 7) 
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104 


/♦Specify the specific message on the queue 


105 


Toy using its type.*/ 


106 


printf ("NiiEnter the msgtyp = "); 


107 


scanf ("96d", 


Sjnsgtyp); 


108 


/*Oao[if igure the coEntrol flags for the 


109 


desir 


ed actions.*/ 


110 


printf ("NnEnter the oarxespoarKaing oode\n"); 


111 


printf ("to select the desired flags: \n"); 


112 


printf ( "Nd flags = 0\n" ) ; 


113 


printf ("M9G_N0ERRDR = 1\n"); 


114 


printf ( "IPCNOWAIT = 2\n" ) ; 


115 


printf("M9G_N0SaOl and IPC_NCWAIT = 3\n"); 


116 


printf {" 


Flags = 


117 


scanf{"96d". 


S^lags); 


118 


svdtch( flags) { 


119 


/♦Set msgf Ig by CRing it vdth the appropriate 


120 




flags (constants) .♦/ 


121 


case 0: 




122 


msgflg 


= 0; 


123 


break; 




124 


case 1: 




125 


msgflg 


1= MS6_NQ£ia%0R; 


126 


break; 




127 


case 2: 




128 


nisgflg 


1= IPC_NCWAIT; 


129 


break; 




130 


case 3: 




131 


msgflg 


1= M9G_N0ERP0R | IPCJJCWAIT; 


132 


break; 




133 


} 






Figure 9-6: msgop() System Call Example (Sheet 5 of 7) 
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134 /^Specify the number of bytes to receive,*/ 

135 pirintf ("NnEnter the nuniber of hytesXn"); 

136 p(rintf( "to receive (msgsz) = "); 

137 soanf("96d", fijnsgsz); 

138 /«Check the values for the arguments.*/ 

139 printf("\nmsqid =%a\n", msqid); 

140 printf ("\nmsgtyp = 9&J\n", msgtyp); 

141 printf ("Nnmsgsz = 96iSn", msgsz); 

142 printf ("\rassgflg = 0?fo\n", msgflg); 

143 /*Call msgrcv to receive the message.*/ 

144 rtm = msgrcv(msqid, msgp, msgsz, msgtyp, msgflg); 

145 if (rtm == -1) { 

146 printf ( "\nMsgrcv failed. " ) ; 

147 printf ("Error = %a\n", ermo); 

148 } 

149 else { 

150 printf ("Njiflsgctl was successf ul\n" ) ; 

151 printf ("for msqid = 9fidNn", 

152 msqid) ; 

153 /*Print the number of bytes received, 

154 it is equal to the return 

155 value.*/ 

156 printf ("Bytes received = «d\n", rtm); 

V J 

Figure 9-6: msgop() System Call Example (Sheet 6 of 7) 



9-36 PROGRAMMER'S GUIDE 



Messages 



( ^ ^ 

157 /♦Print the received message.*/ 

158 for(i = 0; i<=rtm; i++) 

159 patchar(rc:nA3uf .ratext[i] ) ; 

160 } 

161 /*Check the associated data structure.*/ 

162 msgctl{insqid, IPCJSTAT, buf); 

163 /♦Print the decremented nuinber of messages.*/ 

164 printf ( "\n!nje msgjynim = %d\n", twf->insg_CFiam) ; 

165 /♦Print the process id of the last receiver. ♦/ 

166 printfC'The msgLlrpid = 96d\n", buf->rosg_lrpid) ; 

167 /♦Print the last message receive time^/ 

168 printfCUie msg_rtiine = SfidNii", biif->insgLrtiine) ; 

169 } 

170 } 

V J 

Figure 9-6: msgop() System Call Example (Sheet 7 of 7) 
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The semaphore type of IPC allows processes to communicate through the 
exchange of semaphore values. A semaphore is a positive integer (0 through 
32,767). Since many applications require the use of more than one sema- 
phore, the UNIX Operating System has the ability to create sets or arrays of 
semaphores. A semaphore set can contain one or more semaphores up to a 
limit set by the system administrator. The tunable parameter, SEMMSL has a 
default value of 25. Semaphore sets are created by using the semget(2) sys- 
tem call. 

The process performing the semget(2) system call becomes the 
owner/creator, determines how many semaphores are in the set, and sets the 
operation permissions for the set, including itself. This process can subse- 
quently relinquish ownership of the set or change the operation permissions 
using the semctl(), semaphore control, system call. The creating process 
always remains the creator as long as the facility exists. Other processes with 
permission can use semctl() to perform other control functions. 

Provided a process has alter permission, it can manipulate the 
semaphore(s). Each semaphore within a set can be manipulated in two ways 
with the semop(2) system call (which is documented in the Programmer's 
Reference Manual): 

■ incremented 

■ decremented 

To increment a semaphore, an integer value of the desired magnitude is 
passed to the semop(2) system call. To decrement a semaphore, a minus (-) 
value of the desired magnitude is passed. 

The UNIX Operating System ensures that only one process can manipu- 
late a semaphore set at any given time. Simultaneous requests are performed 
sequentially in an arbitrary manner. 

A process can test for a semaphore value to be greater than a certain value 
by attempting to decrement the semaphore by one more than that value. If 
the process is successful, then the semaphore value is greater than that certain 
value. Otherwise, the semaphore value is not. While doing this, the process 
can have its execution suspended (IPC—NOWAIT flag not set) until the sema- 
phore value would permit the operation (other processes increment the sema- 
phore), or the semaphore facility is removed. 
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The ability to suspend execution is called a "blocking semaphore opera- 
tion. " This ability is also available for a process which is testing for a sema- 
phore to become zero or equal to zero; only read permission is required for 
this test, and it is accomplished by passing a value of zero to the semop(2) 
system call. 

On the other hand, if the process is not successful and the process does 
not request to have its execution suspended, it is called a "nonblocking sema- 
phore operation. " In this case, the process is returned a known error code 
(-1), and the external errno variable is set accordingly. 

The blocking semaphore operation allows processes to communicate based 
on the values of semaphores at different points in time. Remember also that 
IPC facilities remain in the UNIX Operating System until removed by a per- 
mitted process or until the system is reinitialized. 

Operating on a semaphore set is done by using the semop(2), semaphore 
operation, system call. 

When a set of semaphores is created, the first semaphore in the set is 
semaphore number zero. The last semaphore number in the set is one less 
than the total in the set. 

An array of these "blocking/nonblocking operations" can be performed 
on a set containing more than one semaphore. When performing an array of 
operations, the "blocking/nonblocking operations" can be applied to any or 
all of the semaphores in the set. Also, the operations can be applied in any 
order of semaphore number. However, no operations are done until they can 
all be done successfully. This requirement means that preceding changes 
made to semaphore values in the set must be undone when a "blocking 
semaphore operation " on a semaphore in the set cannot be completed suc- 
cessfully; no changes are made until they can all be made. For example, if a 
process has successfully completed three of six operations on a set of ten 
semaphores but is "blocked" from performing the fourth operation, no 
changes are made to the set until the fourth and remaining operations are suc- 
cessfully performed. Additionally, any operation preceding or succeeding the 
"blocked" operation, including the blocked operation, can specify that at such 
time that all operations can be performed successfully, that the operation be 
undone. Otherwise, the operations are performed and the semaphores are 
changed, or one "nonblocking operation" is unsuccessful and none are 
changed. All of this is commonly referred to as being " atomically per- 
formed. " 
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The ability to undo operations requires the UNIX Operating System to 
maintain an array of "undo structures" corresponding to the array of sema- 
phore operations to be performed. Each semaphore operation which is to be 
undone has an associated adjust variable used for undoing the operation, if 
necessary. 

Remember, any unsuccessful "nonblocking operation" for a single sema- 
phore or a set of semaphores causes immediate return with no operations per- 
formed at all. When this occurs, a known error code (-1) is returned to the 
process, and the external variable errno is set accordingly. 

System calls make these semaphore capabilities available to processes. 
The calling process passes arguments to a system call, and the system call 
either successfully or unsuccessfully performs its function. If the system call is 
successful, it performs its function and returns the appropriate information. 
Otherwise, a known error code (-1) is returned to the process, and the external 
variable errno is set accordingly. 



Using Semaphores 

Before semaphores can be used (operated on or controlled) a uniquely 
identified data structure and semaphore set (array) must be created. The 
unique identifier is called the semaphore identifier (semid); it is used to iden- 
tify or reference a particular data structure and semaphore set. 

The semaphore set contains a predefined number of structures in an array, 
one structure for each semaphore in the set. The number of semaphores 
(nsems) in a semaphore set is user-selectable. The following members are in 
each structure within a semaphore set: 

■ semaphore text map address 

■ process identification (PID) performing last operation 

■ number of processes awaiting the semaphore value to become greater 
than its current value 

■ number of processes awaiting the semaphore value to equal zero 

There is one associated data structure for the uniquely identified sema- 
phore set. This data structure contains information related to the semaphore 
set as follows: 
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■ operation permissions data (operation permissions structure) 

■ pointer to first semaphore in the set (array) 

■ number of semaphores in the set 

■ last semaphore operation time 

■ last semaphore change time 

The C Programming Language data structure definition for the semaphore 
set (array member) is located in the #include <sys/sem,h> header file and is 
as follows: 



struct sem 
{ 



ushort 
short 
ushort 
ushort 



/* senaFiiore text nap address ♦/ 
/* pid of last C3peratian */ 
/♦ # awaiting senwal > cval V 
/* # awaiting seinval « ♦/ 



}; 



Likewise, the structure definition for the associated semaphore data struc- 
ture is also located in the #include <sys/sem.h> header file and is as fol- 
lows: 
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r 




struct seniid ds 




/♦ operaticn permission struct */ 
/* ptr to first semaphore in set */ 
/♦ # of semaphores in set */ 
/* last semc^ time V 
/* last change time */ 



}; 





Note that the senu-perm member of this structure uses ipc_perm as a 
template. The breakout for the operation permissions data structure is shown 
in Figure 9-1. 

The ipc_perm data structure is the same for all IPC facilities, and it is 
located in the #include <sys/ipc.h> header file. It is shown in the "Mes- 
sages" section. 

The semget(2) system call is used to perform two tasks when only the 
IPC—CREAT flag is set in the semflg argument that it receives: 

■ to get a new semid and create an associated data structure and sema- 
phore set for it 

■ to return an existing semid that already has an associated data struc- 
ture and semaphore set 

The task performed is determined by the value of the key argument passed to 
the semget(2) system call. For the first task, if the key is not already in use 
for an existing semid, a new semid is returned with an associated data struc- 
ture and semaphore set created for it, provided no system tunable parameter 
would be exceeded. 

There is also a provision for specifying a key of value zero (0), which is 
known as the private key (IPC— PRIVATE = 0); when specified, a new semid 
is always returned with an associated data structure and semaphore set 
created for it, unless a system tunable parameter would be exceeded. When 
the ipcs command is performed, the KEY field for the semid is all zeros. 
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When performing the first task, the process which calls semget() becomes 
the owner/creator, and the associated data structure is initialized accordingly. 
Remember, ownership can be changed, but the creating process always 
remains the creator; see the "Controlling Semaphores" section in this chapter. 
The creator of the semaphore set also determines the initial operation permis- 
sions for the facility. 

For the second task, if a semid exists for the key specified, the value of 
the existing semid is returned. If it is not desired to have an existing semid 
returned, a control command (IPC F,XCL) can be specified (set) in the semflg 
argument passed to the system call. The system call will fail if it is passed a 
value for the number of semaphores (nsems) that is greater than the number 
actually in the set; if you do not know how many semaphores are in the set, 
use for nsems. The details of using this system call are discussed in the 
"Using semget" section of this chapter. 

Once a uniquely identified semaphore set and data structure are created, 
semaphore operations [semop(2)] and semaphore control [semctl()] can be 
used. 

Semaphore operations consist of incrementing, decrementing, and testing 
for zero. A single system call is used to perform these operations. It is called 
semop(). Refer to the " Operations on Semaphores " section in this chapter 
for details of this system call. 

Semaphore control is done by using the semctl(2) system call. These con- 
trol operations permit you to control the semaphore facility in the following 
ways: 

■ to return the value of a semaphore 

■ to set the value of a semaphore 

■ to return the process identification (PID) of the last process performing 
an operation on a semaphore set 

■ to return the number of processes waiting for a semaphore value to 
become greater than its current value 

■ to return the number of processes waiting for a semaphore value to 
equal zero 

■ to get all semaphore values in a set and place them in an array in user 
memory 
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■ to set all semaphore values in a semaphore set from an array of values 
in user memory 

■ to place all data structure member values, status, of a semaphore set 
into user memory area 

■ to change operation permissions for a semaphore set 

■ to remove a particular semid from the UNIX Operating System along 
with its associated data structure and semaphore set. 

Refer to the " Controlling Semaphores " section in this chapter for details 
of the semctl(2) system call. 

Getting Semaphores 

This section contains a detailed description of using the seinget(2) system 
call along v^ith an example program illustrating its use. 

Using semget 

The synopsis found in the semget(2) entry in the Programmer's Reference 
Manual is as foUov^s: 




#aiK:lude <sys/types.h> 
#ijK3lude <sys/ipc . h> 
#ijicliide <sys/sem.h> 



int semget (loey, nseros, sang) 
key_t key; 
int nseros, seng; 




The following line in the synopsis informs you that semget() is a function 
with three formal arguments that returns an integer type value upon success- 
ful completion (semid). 

int SCToget (key, nseins, senflg) 
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The next two lines declare the types of the formal arguments, key_»t is 
declared by a typedef in the types.h header file to be an integer, 

key_t key; 

int nsens, senflg; 

The integer returned from this system call upon successful completion is 
the semaphore set identifier (semid) that was discussed earlier. 

As declared, the process calling the semget() system call must supply 
three arguments to be passed to the formal key, nsems, and semflg argu- 
ments. 

A new semid with an associated semaphore set and data structure is pro- 
vided if one of the following conditions exists: 

■ key is equal to IPC_PRIVATE 

■ key is passed a unique hexadecimal integer, and semflg ANDed with 
1PC_CREAT is TRUE. 

The value passed to the semflg argument must be an integer type octal 
value and will specify the following: 

■ access permissions 

■ execution modes 

■ control fields (commands) 

Access permissions determine the read/alter attributes, and execution 
modes determine the user/group/other attributes of the semflg argument. 
They are collectively referred to as "operation permissions." Figure 9-7 
reflects the numeric values (expressed in octal notation) for the valid operation 
permissions codes. 
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Operation Permissions 



Octal Value 



Read by User 
Alter by User 
Read by Group 
Alter by Group 
Read by Others 
Alter by Others 



00400 
00200 
00040 
00020 
00004 
00002 



Figure 9-7: Operation Permissions Codes 



A specific octal value is derived by adding the octal values for the opera- 
tion permissions desired. That is, if read by user and read/alter by others is 
desired, the code value would be 00406 (00400 plus 00006). There are con- 
stants #define'd in the sem.h header file which can be used for the user 
(OWNER). They are: 

SEM _A 0200 /* alter permission by owner ♦/ 
SEM_R 0400 /* read perndssiotn by owner */ 

Control commands are predefined constants (represented by all uppercase 
letters). Figure 9-8 contains the names of the constants which apply to the 
semget(2) system call along with their values. They are also referred to as 
flags and are defined in the ipc.h header file. 



Figure 9-8: Control Commands (Flags) 



The value for the semflg argument is, therefore, a combination of opera- 
tion permissions and control commands. After determining the value for the 
operation permissions as previously described, the desired flag(s) can be speci- 
fied. This specification is accomplished by bitwise ORing (I) them with the 
operation permissions; bit positions and values for the control commands in 
relation to those of the operation permissions make this possible. An example 
of determining the semflg argument follows. 



Control Command 



Value 



IPC_CREAT 
IPC_EXCL 



0001000 
0002000 
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Octal Value 



Binary Value 



IPC_CREAT 

CWI ORed by User = 



1000 
04 



000 001 000 000 000 
000 000 100 000 000 



semflg 



1400 



000 001 100 000 000 



The semflg value can easily be set by using the names of the flags in con- 
junction with the octal operation permissions value: 

sanid = semget (key, nsems, (IPC_CREA!r | 0400)); 

semid = senget (key, nsons, (IPC_CREAT | IPCJXCL | 0400)); 

As specified by the semget(2) entry in the Programmer's Reference Manual, 
success or failure of this system call depends upon the actual argument values 
for key, nsems, semflg or system tunable parameters. The system call will 
attempt to return a new semid if one of the following conditions exists: 

■ key is equal to 1PC_PRIVATE (0) 

■ key does not already have a semid associated with it, and (semflg & 
IPC_CREAT) is TRUE (not zero). 

The key argument can be set to IPCPRIVATE in the following ways: 
semid = s^get (XPCFRDJATE^ nsems, semflg); 



semid = semget ( 0, nsonos, semflg); 

This alone will cause the system call to be attempted because it satisfies the 
first condition specified. Exceeding the SEMMNl, SEMMNS, or SEMMSL 
system-tunable parameters will always cause a failure. The SEMMNI system- 
tunable parameter determines the maximum number of unique semaphore sets 
(semid's) in the UNIX Operating System. The SEMMNS system-tunable 
parameter determines the maximum number of semaphores in all semaphore 
sets systemwide. The SEMMSL system-tunable parameter determines the 
maximum number of semaphores in each semaphore set. 



or 
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The second condition is satisfied if the value for key is not already associ- 
ated with a semid and the bitwise ANDing of semflg and IPC_CREAT is 
TRUE (not zero). This means that the key is unique (not in use) within the 
UNIX Operating System for this facility type and that the IPC-CREAT flag is 
set (semflg I IPC-CREAT). The bitwise ANDing (&), which is the logical way 
of testing if a flag is set, is illustrated as follows: 

seraflg = x1xxx (x = iranaterial) 
& IPC^CREAT =01000 

result =01000 (not zero) 

Since the result is not zero, the flag is set or TRUE. SEMMNI, SEMMNS, and 
SEMMSL apply here also, just as for condition one. 

IPC— EXCL is another control command used in conjunction with 
IPC_CREAT to exclusively have the system call fail if, and only if, a semid 
exists for the specified key provided. This is necessary to prevent the process 
from thinking that it has received a new (unique) semid when it has not. In 
other words, when both IPC_CREAT and IPC_EXCL are specified, a new 
semid is returned if the system call is successful. Any value for semflg 
returns a new semid if the key equals zero (IPC— PRIVATE) and no system- 
tunable parameters are exceeded. 

Refer to the semget(2) manual page for specific, associated data structure 
initialization for successful completion. 

Example Program 

The example program in this section (Figure 9-9) is a menu-driven pro- 
gram which allows all possible combinations of using the semget(2) system 
call to be exercised. 

From studying this program, you can observe the method of passing argu- 
ments and receiving return values. The user-written program requirements 
are pointed out. 

This program begins (lines 4-8) by including the required header files as 
specified by the semget(2) entry in the Programmer's Reference Manual, Note 
that the errno.h header file is included as opposed to declaring errno as an 
external variable; either method will work. 
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Variable names have been chosen to be as close as possible to those in the 
synopsis for the system call. Their declarations are self-explanatory. These 
names make the program more readable, and are perfectly legal since they are 
local to the program. Variables declared for this program and their purpose 
are as follows: 

■ key — is used to pass the value for the desired key. 

■ opperm — is used to store the desired operation permissions. 

■ flags — is used to store the desired control commands (flags). 

■ oppernuJlags — is used to store the combination from the logical 
ORing of the opperm and flags variables; it is then used in the system 
call to pass the semflg argument. 

■ semid — is used for returning the semaphore set identification number 
for a successful system call or the error code (-1) for an unsuccessful 
one. 

The program begins by prompting for a hexadecimal key, an octal opera- 
tion permissions code^ and the control command combinations (flags) which 
are selected from a menu (lines 15-32). All possible combinations are allowed 
even though they might not be viable. This allows observing the errors for 
illegal combinations. 

Next, the menu selection for the flags is combined with the operation per- 
missions, and the result is stored at the address of the opperm-Jlags variable 
(lines 36-52). 

Then, the number of semaphores for the set is requested (lines 53-57), and 
its value is stored at the address of nsems. 

The system call is made next, and the result is stored at the address of the 
semid variable (lines 60 and 61). 

Since the semid variable now contains a valid semaphore set identifier or 
the error code (-1), it is tested to see if an error occurred (line 63). If semid 
equals -1, a message indicates that an error resulted and the external errno 
variable is displayed (lines 65 and 66). Remember that the external errno 
variable is only set when a system call fails; it should only be tested immedi- 
ately following system calls. 
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If no error occurred, the returned semaphore set identifier is displayed 
(line 70). 

The example program for the semget(2) system call follows. It is sug- 
gested that the source program file be named semget.c and that the executable 
file be named semget. 



1 /*Ttd3 is a program to illustrate 

2 **the sentaphare get, sangetO, 

3 ♦♦system call capabilities. ♦/ 

4 #iiiclude <stdio.h> 

5 #incluae <sys/types.h> 

6 ^iiicliide <sys/ipc.h^ 

7 #anclude <sys/S€m.h> 

8 #include <errno.h> 

9 /♦Start of nain C language program^/ 

10 mainO 

11 { 

12 keyjt feey; /♦declare as long integer^/ 

13 int opperm, flags, nsems; 

14 int semid, cppem_flags; 

15 /♦ESiter the desired key^/ 

16 printf ("NnEnter the desired key in hex = "); 

17 scanf("?6jc", &key); 

18 /♦Enter the desired octal operation 

19 permissions. ♦/ 

20 printf ("NnEnter the cperationNn" ) ; 

21 printf ("permissians in octal = "); 

22 scanf("%o", &oppem); 



Figure 9-9: semget() System Call Example (Sheet 1 of 3) 
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23 /*Set the desired flags.*/ 

24 printf ("NjiBiter oorrespandijig number to\n"); 

25 printfC'set the desired flags :\n"); 

26 priixtfC'No flags = ONn"); 

27 prints ("IFCCREAT = 1\n"); 

28 printf("IPC_EX)CL = 2\n"); 

29 printf("IIcaREAT and IPCEXCL = 3\n"); 

30 printf{" Flags = "); 

31 /«Get the flags to be set.*/ 

32 scanfC'Xd", Sflags); 

33 /*Ermr checking (debugging)*/ 

34 parijitf {"\nkey =Cbc96jc, oppenn = OXo, flags = 0%o\n", 

35 key, pppem, flags); 

36 /*Inoarparate the ocmtrol fields (flags) with 

37 the operation permissions.*/ 

38 switch (flags) 

39 { 

40 case 0: /♦!*> flags are to be set,*/ 

41 pppennflags = (opperm | 0); 

42 break; 

43 case 1: /♦Set the IPCCREAT flag.*/ 

44 oppermflags = (opperm | IPC_CFEAT); 

45 break; 

46 case 2: /*Set the IPC_EXCL flag.*/ 

47 cppennflags = {oppenn | IPC_EXCL); 

48 break; 

49 case 3: /♦Set the IPCCREAT and IPC_EXCL 

50 flags. ♦/ 

51 cppennflags = (oppem | IPC_CREAT | IPC^ptCL); 

52 } 

V J 

Figure 9-9: semget() System Call Example (Sheet 2 of 3) 
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53 /•Get the number of semaFhores for this set.*/ 

54 printf ( "\jiEnter the nuniber of\n" ) ; 

55 printf ("desired semaphores fcjrXn"); 

56 printf("this set (25 max) = "); 

57 scaiif("9«d", &nseins); 

58 /*Check the entry.*/ 

59 printf ("\nNsems = %a\n", nsems); 

60 /*Call the semget system caU,*/ 

61 sendd = s€n)get(key, nsems, opperm_flags) ; 

62 /*Perform the follcwing if the call is unsuccessful.*/ 

63 if(semid == -1) 

64 { 

65 printfC'lhe senget system call failed !\n"); 

66 printfC'lhe error number = 96aNn", ermo); 

67 } 

68 /*Retum the sendd upon successful ocxi^xLetion.*/ 

69 else 

70 printf ("\nT3ie semid = %dNn", semid); 

71 exit(O); 

72 } 




Figure 9-9: semget() System Call Example (Sheet 3 of 3) 



Controlling Semaphores 

This section contains a detailed description of using the semctl(2) system 
call along with an example program which allows all of its capabilities to be 
exercised. 
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Using semctl 

The synopsis found in the semctl(2) entry in the Programmer's Reference 
Manual is as follows: 



#incl\2de <sys/l:ypes,h> 
#i2iclT3de <sys/ipc.h> 
#iiicli3dfi <sys/seni.h> 

int semctl (semid, semmim, end, arg) 

ixit. sendd, cmd; 

int seuviuni; 

unxon semun 

{ 



The semctl(2) system call requires four arguments to be passed to it, and it 
returns an integer value. 

The semid argument must be a valid, non-negative, integer value that has 
already been created by using the semget(2) system call. 

The semnum argument is used to select a semaphore by its number. This 
relates to array (atomically performed) operations on the set. When a set of 
semaphores is created, the first semaphore is number 0, and the last sema- 
phore has the number of one less than the total in the set. 

The cmd argument can be replaced by one of the following control com- 
mands (flags): 

■ GETVAL — returns the value of a single semaphore within a sema- 
phore set. 

■ SETVAL — sets the value of a single semaphore within a semaphore 





int val; 

struct s€mid_ds *ba; 
ushort arrsy[]; 



} arg; 





set. 
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■ GETPID — returns the Process Identifier (PID) of the process that per- 
formed the last operation on the semaphore within a semaphore set. 

■ GETNCNT — returns the number of processes waiting for the value of 
a particular semaphore to become greater than its current value. 

■ GETZCNT — returns the number of processes waiting for the value of a 
particular semaphore to be equal to zero. 

■ GETALL — returns the values for all semaphores in a semaphore set. 

■ SETALL — sets all semaphore values in a semaphore set. 

■ IPC_STAT — returns the status information contained in the associated 
data structure for the specified semid, and places it in the data structure 
pointed to by the *buf pointer in the user memory area; arg.buf is the 
union member that contains the value of buf . 

■ IPG—SET — for the specified semaphore set (semid), sets the effective 
user/group identification and operation permissions. 

■ IPC_RMID — removes the specified (semid) semaphore set along with 
its associated data structure. 



A process must have an effective user identification of 
OWNER/CREATOR or super-user to perform an IPC_SET or IPC-RMID con- 
trol command. Read/alter permission is required as applicable for the other 
control commands. 

The arg argument is used to pass the system call the appropriate union 
member for the control command to be performed: 

■ arg.val 

■ arg.buf 

■ arg.array 



The details of this system call are discussed in the example program for it. 
If you have problems understanding the logic manipulations in this program, 
read the "Using semget" section of this chapter; it goes into more detail than 
would be practical to do for every system call. 
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Example Program 

The example program in this sectior\ (Figure 9-10) is a menu-driven pro- 
gram which allows all possible combinations of using the semctl(2) system 
call to be exercised. 

From studying this program, you can observe the method of passing argu- 
ments and receiving return values. The user-written program requirements 
are pointed out. 

This program begins (lines 5-9) by including the required header files as 
specified by the semctl(2) entry in the Programmer's Reference Manual. Note 
that in this program ermo is declared as an external variable, and therefore 
the errno.h header file does not have to be included. 

Variable, structure, and union names have been chosen to be as close as 
possible to those in the synopsis for the system call. Their declarations are 
self-explanatory. These names make the program more readable, and are per- 
fectly legal since they are local to the program. Variables declared for this 
program and their purpose are as follows: 

■ semid—ds — is used to receive the specified semaphore set identifier's 
data structure when an IPC—STAT control command is performed. 

■ c — is used to receive the input values from the scanf(3S) function 
(line 117) when performing a SETALL control command. 

■ i — is used as a counter to increment through the union arg.array when 
displaying the semaphore values for a GETALL (lines 97-99) control 
command, and when initializing the arg.array when performing a 
SETALL (lines 115-119) control command. 

■ length — is used as a variable to test for the number of semaphores in 
a set against the i counter variable (lines 97 and 115). 

■ uid — is used to store the IPC_SET value for the effective user identifi- 
cation. 

■ gid — is used to store the IPC_SET value for the effective group iden- 
tification. 

■ mode — is used to store the IPC_SET value for the operation permis- 
sions. 
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■ rtrn — is used to store the return integer from the system call which 
depends upon the control command or a -1 when unsuccessful. 

■ semid — is used to store and pass the semaphore set identifier to the 
system call. 

■ semnum — is used to store and pass the semaphore number to the sys- 
tem call. 

■ cmd — is used to store the code for the desired control command so 
that subsequent processing can be performed on it. 

■ choice — is used to determine which member (uid, gid, mode) for the 
IPC— SET control command is to be changed. 

■ arg.val — is used to pass the system call a value to set (SETVAL) or to 
store (GETVAL) a value returned from the system call for a single 
semaphore (union member). 

■ arg.buf — is a pointer passed to the system call which locates the data 
structure in the user memory area where the IPC— STAT control com- 
mand is to place its return values, or where the IPC_SET command 
gets the values to set (union member). 

■ arg.array — is used to store the set of semaphore values when getting 
(GETALL) or initializing (SETALL) (union member). 

Note that the semid_ds data structure in this program (line 14) uses the 
data structure located in the sem.h header file of the same name as a template 
for its declaration. This is a perfect example of the advantage of local vari- 
ables. 

The arg union (lines 18-22) serves three purposes in one. The compiler 
allocates enough storage to hold its largest member. The program can then 
use the union as any member by referencing union members as if they were 
regular structure members. Note that the array is declared to have 25 ele- 
ments (0 through 24),This number corresponds to the maximum number of 
semaphores allowed per set (SEMMSL), a system tunable parameter. 

The next important program aspect to observe is that, although the *buf 
pointer member (arg.buf) of the union is declared to be a pointer to a data 
structure of the semid— ds type, it must also be initialized to contain the 
address of the user memory area data structure (line 24). Because of the way 
this program is written, the pointer does not need to be reinitialized later. 
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If it was used to increment through the array, it would need to be reinitialized 
just before calling the system call. 

Now that all of the required declarations have been presented for this pro- 
gram, this is how it works. 

First, the program prompts for a valid semaphore set identifier, which is 
stored at the address of the semid variable (lines 25-27). This is required for 
all semctl(2) system calls. 

Then, the code for the desired control command must be entered (lines 
28-42), and the code is stored at the address of the cmd variable. The code is 
tested to determine the control command for subsequent processing. 

If the GETVAL control command is selected (code 1), a message prompt- 
ing for a semaphore number is displayed (lines 49 and 50). When it is 
entered, it is stored at the address of the semnum variable (line 51). Then, 
the system call is performed, and the semaphore value is displayed (lines 52- 
55). If the system call is successful, a message indicates this along with the 
semaphore set identifier used (lines 195, 196); if the system call is unsuccess- 
ful, an error message is displayed along with the value of the external errno 
variable (lines 191-193). 

If the SETVAL control command is selected (code 2), a message prompting 
for a semaphore number is displayed (lines 56 and 57). When it is entered, it 
is stored at the address of the semnum variable (line 58). Next, a message 
prompts for the value to which the semaphore is to be set, and it is stored as 
the arg.val member of the union (lines 59 and 60). Then, the system call is 
performed (lines 61, 63). Depending upon success or failure, the program 
returns the same messages as for GETVAL above. 

If the GETPID control command is selected (code 3), the system call is 
made immediately since all required arguments are known (lines 64-67), and 
the PID of the process performing the last operation is displayed. Depending 
upon success or failure, the program returns the same messages as for GET- 
VAL above. 

If the GETNCNT control command is selected (code 4), a message 
prompting for a semaphore number is displayed (lines 68-72). When entered, 
it is stored at the address of the semnum variable (line 73). Then, the system 
call is performed, and the number of processes waiting for the semaphore to 
become greater than its current value is displayed (lines 74-77). Depending 
upon success or failure, the program returns the same messages as for GET- 
VAL above. 
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If the GETZCNT control command is selected (code 5), a message prompt- 
ing for a semaphore number is displayed (lines 78-81). When it is entered, it 
is stored at the address of the semnum variable (line 82). Then the system 
call is performed, and the number of processes waiting for the semaphore 
value to become equal to zero is displayed (lines 83-86). Depending upon 
success or failure, the program returns the same messages as for GETVAL 
above. 

If the GETALL control command is selected (code 6), the program first 
performs an IPC—STAT control command to determine the number of sema- 
phores in the set (lines 88-93). The length variable is set to the number of 
semaphores in the set (line 91). Next, the system call is made and, upon suc- 
cess, the arg.array union member contains the values of the semaphore set 
(line 96). Now, a loop is entered which displays each element of the 
arg.array from zero to one less than the value of length (lines 97-103), The 
semaphores in the set are displayed on a single line, separated by a space. 
Depending upon success or failure, the program returns the same messages as 
for GETVAL above. 

If the SETALL control command is selected (code 7), the program first 
performs an IPC— STAT control command to determine the number of sema- 
phores in the set (lines 106-108). The length variable is set to the number of 
semaphores in the set (line 109). Next, the program prompts for the values to 
be set and enters a loop which takes values from the keyboard and initializes 
the arg.array union member to contain the desired values of the semaphore 
set (lines 113-119). The loop puts the first entry into the array position for 
semaphore number zero and ends when the semaphore number that is filled 
in the array equals one less than the value of length. The system call is then 
made (lines 120-122). Depending upon success or failure, the program returns 
the same messages as for GETVAL above. 

If the IPC-STAT control command is selected (code 8), the system call is 
performed (line 127), and the status information returned is printed out (lines 
128-139); only the members that can be set are printed out in this program. 
Note that if the system call is unsuccessful, the status information of the last 
successful one is printed out. In addition, an error message is displayed, and 
the errno variable is printed out (lines 191 and 192). 

If the IPC— SET control command is selected (code 9), the program gets 
the current status information for the semaphore set identifier specified (lines 
143-146). This is necessary because this example program provides for chang- 
ing only one member at a time, and the semctl(2) system call changes all of 
them. Also, if an invalid value happened to be stored in the user memory 
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area for one of these members, it would cause repetitive failures for this con- 
trol command until corrected. The next thing the program does is to prompt 
for a code corresponding to the member to be changed (lines 147-153). This 
code is stored at the address of the choice variable (line 154). Now, depend- 
ing upon the member picked, the program prompts for the new value (lines 
155-178), The value is placed at the address of the appropriate member in the 
user memory area data structure, and the system call is made (line 181). 
Depending upon success or failure, the program returns the same messages as 
for GETVAL above. 

If the IPC_RM1D control command (code 10) is selected, the system call is 
performed (lines 183-185). The semid along with its associated data structure 
and semaphore set is removed from the UNIX Operating System. Depending 
upon success or failure, the program returns the same messages as for the 
other control commands. 

The example program for the semctl(2) system call follows. It is sug- 
gested that the source program file be named semctLc and that the executable 
file be named semctl. 
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1 /*TMs is a program to illustxate 

2 **tlie semapliare control, semctlO, 

3 **systein call capabilities. 

4 V 

5 /♦Include necessary header files,*/ 

6 #include <stdio.h> 

7 #incliide <sys/types.h> 

8 #incliide <sys/ipc.h> 

9 #iiiclude <sys/sem.h?' 

10 /*Start of main C language program*/ 

11 mainO 

12 { 

13 extern int erzno; 

14 struct semid_ds semidjJs; 

15 int c, i, length; 

16 int uid, gid, mode; 

17 int retxn, senid, sennum, and, choice; 

18 union senun { 

19 int val; 

20 struct semidjas *buf ; 

21 ushort array[25]; 

22 } arg; 

23 /*Init3alize the data structure pointer.*/ 

24 arg.buf = &3emid_ds; 

V J 

Figure 9-10: semctlQ System Call Example (Sheet 1 of 7) 
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25 /*Enter the senaphare ID,*/ 

26 printf( "Enter the sendd = 

27 soanf("9fid", Ssendd); 



28 


/*Onoo3e the desired cxxnnand.V 


29 


printf ("NiiEnter the number fcn:\n"); 


30 


priirtfC'the desired and:\n") 




31 


printf ("GETVAL 




INn") 




32 


printfC'SEIVAL 




2\n") 




33 


printf ( "GETEPID 




3\n") 




34 


printf ( "GEaNCUr 


— 


4\n") 




35 


printf ( "GETZQir 




5\n") 




36 


printf ("6ETALL 




6\n") 




37 


printf ( "SETEALL 


s 


7\n") 




38 


printf ( "IPCjSTAT 


s 


8\n") 




39 


printf ("IPCJSET 


s 


9\n") 




40 


printf ( "IPC_BMID 


s 


lONn"); 


41 


printf ( "Entry 


s 


"); 


42 


scanf("96d", Ssxid); 







43 /«Chfick entries.*/ 

44 printf ("\nsendd end = 56i\n\n", 

45 sendd, amd) ; 



46 /♦Set the ocxnnand and do the call.*/ 

47 switch (and) 

48 { 




Figure 9-10: semctl() System Call Example (Sheet 2 of 7) 
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49 case 1: /«Get a specified value.*/ 

50 printf ( "\nEtater the semnum = " ) ; 

51 scanfC'Xd", Ssamium); 

52 /*Do the system call.*/ 

53 retm = senictl(s€niid, senmum, GETVAL, 0); 

54 printf ( "Niflhe senwal = 96a\n", retm); 

55 break; 

56 case 2: /*Set a specified value.*/ 

57 printf ("NnBttter the se mnum = "); 

58 scanfC^**, Ssesnnum); 

59 printf ("NnEncter the value = "); 

60 scanf("%d", Sarg.val); 

61 /♦Do the system call.*/ 

62 retm = senctl(seraid, semnmn, SEHVAL, arg.val); 

63 break; 

64 case 3: /*Get the process ID.*/ 

65 retm = semctl(semid, 0, GLTPIi), 0); 

66 printf ("NnOhe senpid = %a\n", retm); 

67 break; 

68 case 4: /*Get the mmft^er of processes 

69 V4aiting for the semphore to 

70 beocme greater than its current 

71 value.*/ 

72 printf{"\riaiter the semnmn = "); 

73 scanf &semimm); 

74 /*Do the system call.*/ 

75 retm = semctltsemid, seramim, GEaNCOT, 0); 

76 printf ("NrtEie semncnt = 96d", retm); 

77 break; 

V J 

Figure 9-10: semcttO System Call Example (Sheet 3 of 7) 
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78 case 5: /*Get the number of processes 

79 vaiting far the semaphore 

80 value to become zero.*/ 

81 priJitf("\iiEnter the semraim = 

82 scanf("%d", SJsemmim); 

83 /*Do the system call.*/ 

84 retxn = senictl(seird.d, semnum, QETZCm^, 0); 

85 printf("\ii(rhe semzcnt = 3«d", retm); 

86 break; 

87 case 6: /*Get all of the semaphores.*/ 
86 /^*Get the nmnber of semaphores in 

89 the semphore set.*/ 

90 retm = S€nctl(seiidcl, 0, IPC_Sa!AT, arg.buf); 

91 length = arg.buf->S€ransems; 

92 if (retm ~ -1) 

93 goto ERROR; 

94 /^*Get and print all senaphores in the 

95 specified set.*/ 

96 retm - seiQCtl(seciid, 0, GETEALL, arg.array); 

97 for (i = 0; i < length; i++) 

98 { 

99 printf("%i", arg, array [i] ) ; 

100 /*Separate each 

101 senaphore.*/ 

102 printf("9fc", ' '); 

103 } 

104 break; 



Figure 9-10: seinctl() System Call Example (Sheet 4 of 7) 
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105 


case 7: /*Set all senaphores in the set.*/ 


106 


/*Get the number of senapitores in 


107 


the set.*/ 


108 


retm = senctl(sQidd, 0, IPC_STAT, arg.buf); 


109 


length = arg,buf->sem_nsenis; 


110 


printf("Lengfth = %d\n", length); 


111 


if (retm == -1) 


112 


goto ERROR; 


113 


/♦Set the seniaphore set values.*/ 


114 


printf ( "NnEiiter each value :\n" ) ; 


115 


for(i = 0; i < length ; i++) 


116 


{ 


117 


scanf("%i", &c); 


118 


arg.array[i] = c; 


119 


} 


120 


/*Do the system call.*/ 


121 


retm = s€nctl(seniid, 0, SBEALL, arg.array); 


122 


break; 


123 


case 8: /*Get the status for the semaphore set.*/ 


125 


/*<3et and print the current status valxies.*/ 


127 


retm = senJCtl(seBaid, 0, IPCJSTAT, arg.buf); 


128 


printf ("\imie USER ID = 56a\n", 


129 


arg.buf->sem_j)enn.uid) ; 


130 


printf ("Hie GEROUP ID = %a\n", 


131 


arg.buf->sem_perm.gid) ; 


132 


printf ("Uie operation pezmissions = 0?to\n", 


133 


arg.buf->sem_peiin.inDde) ; 


134 


printf ("The number of semaphores in set = 96d\n' 


135 


arg.buf->S€m_nsems) ; 


136 


printf ( "The last semap time = S6a\n" , 


137 


arg.buf->sem_otime) ; 




Figure 9-10: semctlQ System Call Example (Sheet 5 of 7) 
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{ ^ 

138 printf (•"Die last change time = 56i\n", 

1 39 arg . buf->sem_ctiine ) ; 

140 break; 

141 case 9: /*Select and change the desired 

142 member of the data structure.*/ 

143 /*Get the current status values.*/ 

144 retm = senictl(semid, 0, IPC_SEAT, arg.buf); 

145 if (retm == -1) 

146 goto £I%RCH; 

147 /♦Select the ineraber to chaiKre.*/ 

148 printfCNnEnter the namber for the\n"); 

149 printf ("member to be changed: \n" ) ; 

150 printf ("sem_perm.uid = 1\n"); 

151 porintf ("senrperm.gid = 2\n"); 

152 printf ("sem_penn. mode = 3\n"); 

153 printf ("Entry = "); 

154 scanf("%d", &ctoice); 

155 switch( choice ) { 

156 case 1: /*Change the user ID.*/ 

157 printf ("NnEnter USER ID = 

158 scanf ("9«d", Solid); 

159 arg.buf->sem_penn.uid = uid; 

160 printf ("\nUSER ID = 96d\n", 

161 arg.buf->sem_perm.uid) ; 

162 break; 

163 case 2: /*Change the group ID,*/ 

164 printf ( "\nEiiter GROUP ID = " ) ; 

165 scanf("5«d", &gid); 

166 arg.buf->sem_perm,gid = gid; 

167 printf ( "NnGROUP ID = 9€d\n", 

168 arg,buf->semjjerm.gid) ; 

169 break; 

V ) 

Figure 9-10: semctl() System Call Example (Sheet 6 of 7) 
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170 case 3: /*Change the mode portion of 

171 the cperation 

172 permissions.*/ 

173 printfC'NiiEtater VCm = 

174 scaiif("5fo", Sjnode); 

175 arg.buf->sem_j)erm.rtode = mode; 

176 pcintf("\nMM)E = O96o\n", 

177 arg.buf->SQtrpeim.nDde) ; 

178 break; 

179 } 

180 /*Do the change.*/ 

181 retm = seinctl(S€ndd, 0, IPC_SErr, arg.tuf); 

182 break; 

183 case 10: /*Raiove the semid along with its 

184 data structure.*/ 

185 retm = seDQCtl(sendd, 0, IPCraUD, 0); 
166 } 

187 /*Perform the following if the call is unsuccessful,*/ 

188 if (retm == -1) 

189 { 

190 ^SBOR: 

191 printf {"\n\rtlhe senctl system call fadledINn"); 

192 printf ("TSie error nmnber = 96d\n", ermo); 

193 exit(O); 

194 } 

195 printf ("NnNriHie semctl system call y/tas suocessfulNn" ) ; 

196 printf ("for semid = %a\n", semid); 

197 exit (0); 

198 } 



Figure 9-10: semctlQ System Call Example (Sheet 7 of 7) 
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Operations on Semaphores 

This section contains a detailed description of using the semop(2) system 
call along with an example program which allows all of its capabilities to be 
exercised. 

Using semop 

The synopsis found in the semop(2) entry in the Programmer's Reference 
Manual is as follows: 



#iiiclude <sys/types.h> 
#irtclude <sy5/ipc.h>' 
#include <sys/san.h> 

int s€XDOp (sendd, sops, nsops) 
int sendd; 

struct senibuf **sops; 
unsigned nsops; 

The semop(2) system call requires three arguments to be passed to it, and 
it returns an integer value. Upon successful completion, a zero value is 
returned. When unsuccessful, a -1 is returned. 

The semid argument must be a valid, non-negative, integer value; that is, 
it must have already been created by using the semget(2) system call. 

The sops argument is a pointer to an array of structures in the user 
memory area that contains the following for each semaphore to be changed: 

■ the semaphore number 

■ the operation to be performed 

■ the control command (flags) 



INTERPROCESS COMMUNICATION 9-67 



Semaphores 



The **sops declaration means that a pointer can be initialized to the 
address of the array, or the array name can be used since it is the address of 
the first element of the array. Sembuf is the tag name of the data structure 
used as the template for the structure members in the array; it is located in the 
#include <sys/sein.h> header file. 

The nsops argument specifies the length of the array (the number of struc- 
tures in the array). The maximum size of this array is determined by the 
SEMOPM system tunable parameter. Therefore, a maximum of SEMOPM 
operations can be performed for each semop(2) system call. 

The semaphore number determines the particular semaphore within the 
set on which the operation is to be performed. 

The operation to be performed is determined by the following: 

■ a positive integer value means to increment the semaphore value by its 
value 

■ a negative integer value means to decrement the semaphore value by 
its value 

■ a value of zero means to test if the semaphore is equal to zero 

The following operation commands (flags) can be used: 

■ IPC—NOWAIT — this operation command can be set for any operations 
in the array. The system call will return unsuccessfully without chang- 
ing any semaphore values at all if any operation for which 
IPC—NOWAIT is set cannot be performed successfully. The system call 
will be unsuccessful when trying to decrement a semaphore more than 
its current value, or when testing for a semaphore to be equal to zero 
when it is not. 

■ SEM_UNDO — this operation command allows any operations in the 
array to be undone when any operation in the array is unsuccessful and 
does not have the IPC_NOWAIT flag set. That is, the blocked opera- 
tion waits until it can perform its operation; and when it and all 
succeeding operations are successful, all operations with the 
SEM_UNDO flag set are undone. Remember, no operations are per- 
formed on any semaphores in a set until all operations are successful. 
Undoing is accomplished by using an array of adjust values for the 
operations that are to be undone when the blocked operation and all 
subsequent operations are successful. 
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Example Program 

The example program in this section (Figure 9-11) is a menu-driven pro- 
gram which allows all possible combinations of using the semop(2) system 
call to be exercised. 

From studying this program, you can observe the method of passing argu- 
ments and receiving return values. The user-written program requirements 
are pointed out. 

This program begins (lines 5-9) by including the required header files as 
specified by the shinop(2) entry in the Programmer's Reference Manual. Note 
that in this program errno is declared as an external variable, and therefore, 
the errno.h header file does not have to be included. 

Variable and structure names have been chosen to be as close as possible 
to those in the synopsis for the system call. Their declarations are self- 
explanatory. These names make the program more readable, and are perfectly 
legal since they are local to the program. Variables declared for this program 
and their purpose are as follows: 

■ sembuf[10] — is used as an array buffer (line 14) to contain a maximum 
of ten sembuf type structures; ten equals SEMOPM, the maximum 
number of operations on a semaphore set for each semop(2) system 
call. 

■ «sops — is used as a pointer (line 14) to sembu£[10] for the system call 
and for accessing the structure members within the array. 

■ rtrn — is used to store the return values from the system call. 

■ flags— is used to store the code of the IPC_NOWAIT or SEM_UNDO 
flags for the semop(2) system call (line 60). 

■ i — is used as a counter (line 32) for initializing the structure members 
in the array, and used to print out each structure in the array (line 79). 

■ nsops — is used to specify the number of semaphore operations for the 
system call — must be less than or equal to SEMOPM. 

■ semid — is used to store the desired semaphore set identifier for the 
system call. 
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First, the program prompts for a semaphore set identifier that the system 
call is to perform operations on (lines 19-22). Semid is stored at the address 
of the semid variable (line 23). 

A message is displayed requesting the number of operations to be per- 
formed on this set (lines 25-27). The number of operations is stored at the 
address of the nsops variable (line 28). 

Next, a loop is entered to initialize the array of structures (lines 30-77). 
The semaphore number, operation, and operation command (flags) are entered 
for each structure in the array. The number of structures equals the number 
of semaphore operations (nsops) to be performed for the system call, so nsops 
is tested against the i counter for loop control. Note that sops is used as a 
pointer to each element (structure) in the array, and sops is incremented just 
like i. sops is then used to point to each member in the structure for setting 
them. 

After the array is initialized, all of its elements are printed out for feed- 
back (lines 78-85). 

The sops pointer is set to the address of the array (lines 86 and 87), Sem- 
buf could be used directly, if desired, instead of sops in the system call. 

The system call is made (line 89), and depending upon success or failure, 
a corresponding message is displayed. The results of the operation(s) can be 
viewed by using the semctl() GETALL control command. 

The example program for the semop(2) system call follows. It is sug- 
gested that the source program file be named semop.c and that the executable 
file be named semop. 
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1 /*a3iis is a pojogr a m to illiistrate 

2 -"-^tiie senaptiore operaticais, sanopO, 

3 -"-^system call capabilities. 

4 •/ 

5 /*Include necessary header files.*/ 

6 #inc;lude <stdio.h> 

7 #ii)clude <sys/types,h> 

8 #iiiclude <sys/ipc,h> 

9 #iiiclude <sys/seni.h> 

10 /♦Start of nain C language program*/ 

11 main( ) 

12 { 

13 extern int ermo; 

14 struct senbuf sentouf[10], *sops; 

15 char string[]; 

16 iixt re'bni, flags, seinjmzD, i, seniicL; 

17 unsigned nsops; 

18 sops = sentouf ; /♦Pointer to array seiibuf .♦/ 

19 /♦Enter the semaphore ID,*/ 

20 printf ("\nEnter the semid ofNn"); 

21 printf ("the semaphare set toNji"); 

22 printf ("be operated on = "); 

23 scanf("%d", asendd); 

24 pirintf ("\nsemid = 961", sendd); 



Figure 9-11: semop(2) System Call Example (Sheet 1 of 4) 
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25 /♦alter the number of operations.*/ 

26 prdixtf ("NiiEnter the number of semphoreNn" ) ; 

27 printf ("operatixans for this set = "); 

28 scanf("%d", tosops); 

29 printfC'Nnnosops = ?6d", nsops); 

30 /♦Initialize the array for the 

31 number of operations to be perfonned.^/ 

32 for(i = 0; i < nsops; i++, spps++) 

33 { 

34 /♦flhis determines the semaphore in 

35 the seoaphore set.^/ 

36 printf ("NnEnter the semaphoreNn" ) ; 

37 printf ("nutriser (semjimn) = "); 

38 scanf £.semnum); 

39 sops->sem_num = semnum; 

40 printf ( "\n11ie semnum = 96d" , sops->semnum) ; 

41 /♦Enter a (-)nainber to decrement, 

42 an unsigned number (no ->•) to increment, 

43 or zero to test for zero. Ihese values 

44 are entered into a string and converted 

45 to integer values.*/ 

46 printf ("NnE^nter the operation forNn"); 

47 printf ("the senaphore (semop) = **); 

48 scanf('*9te", string); 

49 sope->sero_op = atoi ( string ) ; 

50 printf ("\nsem_op = 96a\n", sops->sem_op) ; 

V ' " J 

Figure 9-11: semop(2) System Call Example (Sheet 2 of 4) 
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51 /♦Specify the desired flags.*/ 

52 printf ("NiiEnter the oarrespoffidingNn" ) ; 

53 printf ("number tar the desired\n"); 

54 printf ( "flags 

55 printf ("No flags = ONn"); 

56 printf ("liCNDWAIT = 1\n"); 

57 printf ("SEMJINDO = 2\n"); 

58 printf ( "IPCJCWAIT and SEMJJNDO = 3\n" ) ; 

59 printf (" Blags = "); 

60 scanf("96d", &flags); 

61 switch( flags) 

62 { 

63 case 0: 

64 sops->sem_flg = 0; 

65 break; 

66 case 1: 

67 sops->seinflg = IPCJICWAIT; 

68 break; 

69 case 2: 

70 scps->s€m_flg = S3EM_UND0; 

71 break; 

72 case 3: 

73 spps->sem_flg = IPC_NOWAIT | SEMJJNEO; 

74 break; 

75 } 

76 printf ("NnFlags = O96o\n", scps->sem_flg) ; 

77 } 

V J 

Figure 9-11: semop(2) System Call Example (Sheet 3 of 4) 
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78 /*Erint out each stxucture in the array.*/ 

79 for(i = 0; i < nsops; i++) 

80 { 

81 printf = %aNn'S sen±uf[i].seni_nmn); 

82 printf ("sem_op = 56d\n", senijuf [i].SQnop); 

83 printf ("senLflg = 9foNn", scmbuf [i] .seinflg) ; 

84 printf ("5te", ' 

85 } 

86 sQps = senibuf ; /♦Reset the pointer to 

87 serobuf[0].V 

88 /*Do the sencp system call,*/ 

89 retm - s€niop(semid, sops, nsops); 

90 if(retm==-1) { 

91 printf("\nSeniop failed. "); 

92 printf ("Error = %cl\n", ermo); 

93 } 

94 else { 

95 printf { "NnSemop was successfulNn" ) ; 

96 printf ("for sendd = 56d\n", seraid); 

97 printf C^alue returned = 9fia\n", retm); 

98 } 

99 } 

V J 



Figure 9-11: semop(2) System Call Example (Sheet 4 of 4) 
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The shared memory type of IPC allows two or more processes (executing 
programs) to share memory and, consequently, the data contained there. This 
is done by allowing processes to set up access to a common virtual memory 
address space. This sharing occurs on a segment basis, which is memory 
management hardware dependent. 

This sharing of memory provides the fastest means of exchanging data 
between processes. 

A process initially creates a shared memory segment facility using the 
shmget(2) system call. Upon creation, this process sets the overall operation 
permissions for the shared memory segment facility, sets its size in bytes, and 
can specify that the shared memory segment is for reference only (read-only) 
upon attachment. If the memory segment is not specified to be for reference 
only, all other processes with appropriate operation permissions can read from 
or write to the memory segment. 

There are two operations that can be performed on a shared memory seg- 
ment: 

■ shmat(2) — shared memory attach 

■ shmdt(2) — shared memory detach 

Shared memory attach allows processes to associate themselves with the 
shared memory segment if they have permission. They can then read or write 
as allowed. 

Shared memory detach allows processes to disassociate themselves from a 
shared memory segment. Therefore, they lose the ability to read from or 
write to the shared memory segment. 

The original owner/creator of a shared memory segment can relinquish 
ownership to another process using the shmctl(2) system call. However, the 
creating process remains the creator until the facility is removed, or the system 
is reinitialized. Other processes with permission can perform other functions 
on the shared memory segment using the shmctl(2) system call. 
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System calls, which are documented in the Programmer's Reference Manual, 
make these shared memory capabilities available to processes. The calling 
process passes arguments to a system call, and the system call either success- 
fully or unsuccessfully performs its function. If the system call is successful, it 
performs its function and returns the appropriate information. Otherwise, a 
known error code (-1) is returned to the process, and the external variable 
errno is set accordingly. 



Using Shared Memory 

The sharing of memory between processes occurs on a virtual segment 
basis. There is one and only one instance of an individual shared memory 
segment existing in the UNIX Operating System at any point in time. 

Before sharing of memory can be realized, a uniquely identified shared 
memory segment and data structure must be created. The unique identifier 
created is called the shared memory identifier (shmid); it is used to identify or 
reference the associated data structure. The data structure includes the follow- 
ing for each shared memory segment: 

■ operation permissions 

■ segment size 

B segment descriptor 

■ process identification performing last operation 

■ process identification of creator 

■ current number of processes attached 

■ in memory number of processes attached 

■ last attach time 

■ last detach time 

■ last change time 

The C Programming Language data structure definition for the shared 
memory segment data structure is located in the /usr/include/sys/shm.h 
header file. It is as follows: 
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** 19iere is a shared msm id data structure for 

♦* each segment in the system. 

♦/ 



struct shmidjfls { 



struct ipcjperm 


shiQjperm; 


/♦ 


c^seration permission struct */ 


int 


shiDjsegsz; 


/* 


segment size */ 


struct recpiccn 


*shm_reg; 


/* 


ptr to region structure */ 


char 


pad[4]; 


/* 


for swap conpatibility ♦/ 


ushort 


shznJLpid; 


/* 


pid of last shmop */ 


ushort 


£ihin_cpid} 


/* 


pid of creator ♦/ 


ushort 


shinnattch; 


/♦ 


used only for shndnf o ♦/ 


ushort 


shrn_cnattch; 


/* 


used only for shndufo */ 


timejb 


shm_atiine; 


/♦ 


last shmat time V 


timejb 


shm_dtline; 


/* 


last shmdt time V 


timejb 


shnrctime; 


/* 


last chai^ time ♦/ 



}; 




Note that the shm_perm member of this structure uses ipc_perm as a 
template. The breakout for the operation permissions data structure is shown 
in Figure 9-1. 

The ipcperm data structure is the same for all IPC facilities, and it is 
located in the #include <sys/ipc,h> header file. It is shown in the introduc- 
tion section of " Messages. " 
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Figure 9-12 is a table that shows the shared memory state information. 



Shared Memory States 



Lock Bit 


Swap Bit 


Allocated Bit 


Implied State 











Unallocated Segment 








1 


Incore 





1 





Unused 





1 


1 


On Disk 


1 





1 


Locked Incore 


1 


1 





Unused 


1 








Unused 


1 


1 


1 


Unused 



Figure 9-12: Shared Memory State Information 



The implied states of Figure 9-12 are as follows: 

■ Unallocated Segment — the segment associated with this segment 
descriptor has not been allocated for use. 

■ Incore — the shared segment associated with this descriptor has been 
allocated for use. Therefore, the segment does exist and is currently 
resident in memory. 

■ On Disk — the shared segment associated with this segment descriptor 
is currently resident on the swap device. 

■ Locked Incore — the shared segment associated with this segment 
descriptor is currently locked in memory and will not be a candidate for 
swapping until the segment is unlocked. Only the super-user may lock 
and unlock a shared segment. 

■ Unused — this state is currently unused and should never be encoun- 
tered by the normal user in shared memory handling. 
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The 8hmget(2) system call is used to perform two tasks when only the 
IPC_CREAT flag is set in the shmfig argument that it receives: 

■ to get a new shmid and create an associated shared memory segment 
data structure for it 

■ to return an existing shmid that already has an associated shared 
memory segment data structure 

The task performed is determined by the value of the key argument 
passed to the shmget(2) system call. For the first task, if the key is not 
already in use for an existing shmid, a new shmid is returned with an associ- 
ated shared memory segment data structure created for it, provided no system 
tunable parameters would be exceeded. 

There is also a provision for specif3dng a key of value zero which is 
known as the private key (IPC_PRIVATE = 0); when specified, a new shmid 
is always returned with an associated shared memory segment data structure 
created for it, unless a system tunable parameter would be exceeded. When 
the ipcs command is performed, the KEY field for the shmid is all zeros. 

For the second task, if a shmid exists for the key specified, the value of 
the existing shmid is returned. If it is not desired to have an existing shmid 
returned, a control command (IPC HXCL) can be specified (set) in the shmfig 
argument passed to the system call. The details for using this system call are 
discussed in the "Using shmget" section of this chapter. 

When performing the first task, the process that calls shmget becomes the 
owner/creator, and the associated data structure is initialized accordingly. 
Remember, ownership can be changed, but the creating process always 
remains the creator; see the "Controlling Shared Memory" section in this 
chapter. The creator of the shared memory segment also determines the ini- 
tial operation permissions for it. 

Once a uniquely identified shared memory segment data structure is 
created, shared memory segment operations [shmopO] and control [shmctl(2)] 
can be used. 

Shared memory segment operations consist of attaching and detaching 
shared memory segments. System calls are provided for each of these opera- 
tions; they are shmat(2) and shmdt(2). Refer to the "Operations for Shared 
Memory" section in this chapter for details of these system calls. 
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Shared Memory 

Shared memory segment control is done by using the shmctl(2) system 
call. It permits you to control the shared memory facility in the following 
ways: 

■ by determining the associated data structure status for a shared 
memory segment (shmid) 

■ by changing operation permissions for a shared memory segment 

■ by removing a particular shmid from the UNIX Operating System 
along with its associated shared memory segment data structure 

■ by locking a shared memory segment in memory 

■ by unlocking a shared memory segment. 

Refer to the "Controlling Shared Memory" section in this chapter for 
details of the shmctl(2) system call. 



Getting Shared Memory Segments 

This section gives a detailed description of using the shmget(2) system 
call along with an example program illustrating its use. 

Using shmget 

The synopsis found in the shmget(2) entry in the Programmer's Reference 
Manual is as follows: 



#include <s3re/types.h> 
^include <sys/ipc,h> 
^include <sys/shm.h> 

int shmget (loeyr, size, slmflg) 

key_t toey; 

int size, shmflg; 
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All of these include files are located in the /usr/include/sys directory of 
the UNIX Operating System. The following line in the synopsis informs you 
that shmget(2) is a function with three formal arguments that returns an 
integer type value, upon successful completion (shmid). 

int shmget (key, size, shmflg) 

The next two lines declare the types of the formal arguments. The variable 
key_t is declared by a typedef in the t]rpes.h header file to be an integer. 

keyj: key; 

int size, shmflg; 

The integer returned from this function upon successful completion is the 
shared memory identifier (shmid) that was discussed earlier. 

As declared, the process calling the shmget(2) system call must supply 
three arguments to be passed to the formal key, size, and shmflg arguments. 

A new shmid with an associated shared memory data structure is pro- 
vided if one of the following conditions exists: 

■ key is equal to IPC-PRIVATE 

■ key is passed a unique hexadecimal integer, and shmflg ANDed with 
IPC_CREAT is TRUE. 

The value passed to the shmflg argument must be an integer type octal 
value and will specify the following: 

■ access permissions 

■ execution modes 

■ control fields (commands) 

Access permissions determine the read/ write attributes, and execution 
modes determine the user/group/other attributes of the shmflg argument. 
They are collectively referred to as "operation permissions." Figure 9-13 
reflects the numeric values (expressed in octal notation) for the valid operation 
permissions codes. 
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Operation Permissions 



Octal Value 



Read by User 



00400 
00200 
00040 
00020 
00004 
00002 



Write by User 
Read by Group 
Write by Group 
Read by Others 
Write by Others 



Figure 9-13: Operation Permissions Codes 



A specific octal value is derived by adding the octal values for the opera- 
tion permissions desired. That is, if read by user and read/ write by others is 
desired, the code value would be 00406 (00400 plus 00006). There are con- 
stants located in the shm.h header file which can be used for the user 
(OWNER). They are as follows: 

SHM_R 0400 
SHM_W 0200 

Control commands are predefined constants (represented by all uppercase 
letters). Figure 9-14 contains the names of the constants that apply to the 
shmgetO system call along with their values. They are also referred to as 
flags and are defined in the ipch header file. 



Figure 9-14: Control Commands (Flags) 



The value for the shmflg argument is, therefore, a combination of opera- 
tion permissions and control commands. After determining the value for the 
operation permissions as previously described, the desired flag(s) can be speci- 
fied. This specification is accomplished by bitwise ORing (I) them with the 
operation permissions; bit positions and values for the control commands in 
relation to those of the operation permissions make this possible. An example 
of determining the shmflg argument follows. 



Control Command 



Value 



IPC_CREAT 
IPC-EXCL 



0001000 
0002000 
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Octal Value 



Binary Value 



IPC_CREAT 

I ORed by User = 

shmflg = 



01000 
00 4 



000 001 000 000 000 
000 000 100 000 000 



14 



000 001 100 000 000 



The shmflg value can easily be set by using the names of the flags in con- 
junction with the octal operation permissions value: 

shmid = shmget (key, size, (IPC_CREAT | 0400)); 

shmid = shmcfet (key, size, (IPCCREAT | IPCJMCaii | 0400)); 

As specified by the shmget(2) entry in the Programmer's Reference Manual, 
success or failure of this system call depends upon the argument values for 
key, size, and shmflg or system tunable parameters. The system call will 
attempt to return a new shmid if one of the following conditions exists: 

■ key is equal to IPC-PRIVATE (0). 

■ key does not already have a shmid associated with it, and (shmflg & 
IPC-CREAT) is TRUE (not zero). 

The key argument can be set to IPC—PRIVATE in the following ways: 
shndd = shmget (IPC_ERIVArE, size, shmflg); 



shmid = sihmget ( , size, shiaflg); 

This alone will cause the system call to be attempted because it satisfies the 
first condition specified. Exceeding the SHMMNI system tunable parameter 
always causes a failure. The SHMMNI system tunable parameter determines 
the maximum number of unique shared memory segments (shmids) in the 
UNIX Operating System. 

The second condition is satisfied if the value for key is not already associ- 
ated with a shmid and the bitwise ANDing of shmflg and IPC_CREAT is 
TRUE (not zero). This means that the key is unique (not in use) within the 
UNIX Operating System for this facility type and that the IPC_CREAT flag is 
set (shmflg I IPC_CREAT). The bitwise ANDing (&), which is the logical way 
of testing if a flag is set, is illustrated as follows: 



or 
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shmflg = x1xxx (x= immaterial) 
& IPCa^EAT = 01000 

result =01000 (not zero) 

Because the result is not zero, the flag is set or TRUE. SHMMNI applies here 
also, just as for condition one. 

IPC_EXCL is another control command used in conjunction with 
IPC_CREAT to exclusively have the system call fail if, and only if, a shmid 
exists for the specified key provided. This is necessary to prevent the process 
from thinking that it has received a new (unique) shmid when it has not. In 
other words, when both IPC-CREAT and IPC_EXCL are specified, a unique 
shmid is returned if the system call is successful. Any value for shmflg 
returns a new shmid if the key equals zero (IPC_PRIVATE). 

The system call will fail if the value for the size argument is less than 
SHMMIN or greater than SHMMAX. These tunable parameters specify the 
minimum and maximum shared memory segment sizes. 

Refer to the shmget(2) manual page for specific, associated data structure 
initialization for successful completion. The specific failure conditions with 
error names are contained there also. 

Example Program 

The example program in this section (Figure 9-15) is a menu-driven pro- 
gram which allows all possible combinations of using the shmget(2) system 
call to be exercised. 

From studying this program, you can observe the method of passing argu- 
ments and receiving return values. The user-written program requirements 
are pointed out. 

This program begins (lines 4-7) by including the required header files as 
specified by the shmget(2) entry in the Programmer's Reference Manual Note 
that the errno.h header file is included as opposed to declaring errno as an 
external variable; either method will work. 

Variable names have been chosen to be as close as possible to those in the 
synopsis for the system call. Their declarations are self-explanatory. These 
names make the program more readable, and are perfectly legal since they are 
local to the program. Variables declared for this program and their purposes 
are as follows: 
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■ key — is used to pass the value for the desired key. 

■ opperm — is used to store the desired operation permissions. 

■ flags — is used to store the desired control commands (flags). 

■ oppenn— flags — is used to store the combination from the logical 
ORing of the opperm and flags variables; it is then used in the system 
call to pass the shmflg argument. 

■ shmid — is used for returning the message queue identification number 
for a successful system call or the error code (-1) for an unsuccessful 
one, 

■ size — is used to specify the shared memory segment size. 

The program begins by prompting for a hexadecimal key, an octal opera- 
tion permissions code, and the control command combinations (flags) which 
are selected from a menu (lines 14-31). All possible combinations are allowed 
even though they might not be viable. This allows observing the errors for 
illegal combinations. 

Next, the menu selection for the flags is combined with the operation per- 
missions, and the result is stored at the address of the opperm_flags variable 
(lines 35-50). 

A display then prompts for the size of the shared memory segment, and it 
is stored at the address of the size variable (lines 51-54). 

The system call is made next, and the result is stored at the address of the 
shmid variable (line 56). 

Since the shmid variable now contains a valid message queue identifier or 
the error code (-1), it is tested to see if an error occurred (line 58). If shmid 
equals -1, a message indicates that an error resulted and the external errno 
variable is displayed (lines 60 and 61). 

If no error occurred, the returned shared memory segment identifier is 
displayed (line 65). 

The example program for the shmget(2) system call follows. It is sug- 
gested that the source program file be named shmget.c and that the execut- 
able file be named shmget. 
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When compUing C programs that use floating point operations, the -f 
option should be used on the cc command line. If this option is not used, the 
program will compile successfully, but when the program is executed, it will 
fail 



1 /*fniis is a program to illustrate 

2 **the shared memory get, shmget( ) , 

3 **system call capabilities. V 

4 #iiiclude <sys/t7pes.h> 

5 #include <sys/ipc.h> 

6 #i2iclude <sys/sbm.h^ 

7 #iiicl\ide <emio.h> 

8 /*Start o£ main C language program*/ 

9 mainO 

10 { 

11 keyjt teey; /^declare as long integer*/ 

12 int cppem, flags; 

13 int shmid, size, pRpenn_f lags ; 

14 /*Eliter the desired key*/ 

15 printfC'Ekxter the desired key in hex = "); 

16 scanf("%»c", Skey); 

17 /*Enter tiie desired octal operation 

18 permissions . ♦/ 

19 printf ( "VjiEnter the cperationNn" ) ; 

20 printf ( "permissions in octal = '* } ; 

21 scanf(**%o", Copperm); 



Figure 9-15: shmget(2) System Call Example (Sheet 1 of 3) 
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22 /*Set the desired flags.*/ 

23 printf ( "\nEnter oarrespandiixr number to\n" ) ; 

24 printf ("set the desired flags :\n"); 

25 pQnjitf("N6 flags = 0\n"); 

26 printf ("IKLCKEAT = 1\n"); 

27 printf (''IPC_EXCL = 2Nn"); 

28 printf ("IPCCREAT and IPCJXCL = 3\n"); 

29 printf (" Flags = "); 

30 /«Get the flag(s) to be set.*/ 

31 scaiif("56i", &flags); 

32 /*Check the values.*/ 

33 printf ("Niikey =03(?6x, qppenn = 0%o, flags = 09to\n", 

34 key, oppem, flags); 

35 /*IiKxarparate the control fields (flags) with 

36 the operation permissions*/ 

37 switch (flags) 

38 { 

39 case 0: /*No flags are to be set.*/ 

40 oppenn_flags = (opperm | 0); 

41 break; 

42 case 1: /♦Set the IPC_C3lE«r flag.*/ 

43 oppennflags = (ppperm | IPC_CREAT) ; 

44 break; 

45 case 2: /*Set the IPCJXCL flag.*/ 

46 ofppemijElags = (opperm | IPC_EXCL); 

47 break; 

48 case 3: /*Set the HCCREtfT and IPCJXCL flags.*/ 

49 oppennflags = (opperm | IPCGREAT | IPCJXCL); 

50 } 

V J 

Figure 9-15: shmget(2) System Call Example (Sheet 2 of 3) 
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51 /«Get the size of the segment in bytes. V 

52 printf ( "NnE^nter -Uie segment" ) ; 

53 printf ( "\nsize in bytes = " ) ; 

54 scanf ("?^d", &size); 

55 /*Call the shmget system call.*/ 

56 shmid = shmget (key, size, qppeni!_£lags) ; 

57 /*Pe r£ cgro the follcwing if the call is unsuccessful.*/ 

58 if (shmid == -1) 

59 { 

60 printf ("NrtHie shmget system cadi failed I\n"); 

61 printf ("Hie error number = 96d>ai", enro); 

62 } 

63 /*Retum the shmid upon successful completion.*/ 

64 else 

65 printf ("NriHie shmid = %d\n", shmid); 

66 exit(O); 

67 } 




Figure 9-15: shmget(2) System Call Example (Sheet 3 of 3) 



Controlling Shared Memory 

This section gives a detailed description of using the shmctl(2) system call 
along with an example program which allows all of its capabilities to be exer- 
cised. 
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Using shmctl 

The synopsis found in the shmctl(2) entry in the Programmer's Reference 
Manual is as follows: 



#include <sys/types,h> 
#iiiclude <s5rs/ipc.h> 
#iiiclude <sys/shin.h> 

int shmctl (shndd, end, buf) 
int shndd, cxodj 
struct shinid_ds *buf ; 



The shmctl(2) system call requires three arguments to be passed to it, and it 
returns an integer value. Upon successful completion, a zero value is 
returned. When unsuccessful, a -1 is returned. 

The shmid variable must be a valid, non-negative, integer value. In other 
words, it must have already been created by using the shmget(2) system call. 

The and argument can be replaced by one of the following control com- 
mands (flags): 

■ IPC_STAT — returns the status information contained in the associated 
data structure for the specified shmid and places it in the data structure 
pointed to by the *bu£ pointer in the user memory area 

■ IPC_SET — for the specified shmid, sets the effective user and group 
identification, and operation permissions 

■ IPC_RMID — removes the specified shmid along with its associated 
shared memory segment data structure 

■ SHM_LOCK — locks the specified shared memory segment in memory; 
must be super-user 

■ SHM_UNLOCK — unlocks the shared memory segment from memory; 
must be super-user. 
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A process must have an effective user identification of 
OWNER/CREATOR or super-user to perform an IPC_SET or IPCRMID con- 
trol command. Only the super-user can perform a SHM_LOCK or 
SHM—UNLOCK control command. A process must have read permission to 
perform the IPC— STAT control command. 

The details of this system call are discussed in the example program for it. 
If you have problems understanding the logic manipulations in this program, 
read the "Using shmget" section of this chapter; it goes into more detail than 
would be practical to do for every system call. 

Example Program 

The example program in this section (Figure 9-16) is a menu-driven pro- 
gram which allows all possible combinations of using the shmctl(2) system 
call to be exercised. 

From stud3dng this program, you can observe the method of passing argu- 
ments and receiving return values. The user-written program requirements 
are pointed out. 

This program begins (lines 5-9) by including the required header files as 
specified by the shmctl(2) entry in the Programmer's Reference Manual Note 
in this program that errno is declared as an external variable, and therefore, 
the errno.h header file does not have to be included. 

Variable and structure names have been chosen to be as close as possible 
to those in the synopsis for the system call. Their declarations are self- 
explanatory. These names make the program more readable, and are perfectly 
legal, since they are local to the program. Variables declared for this program 
and their purposes are as follows: 

■ uid — is used to store the IPC— SET value for the effective user identifi- 
cation. 

■ gid — is used to store the IPCSET value for the effective group iden- 
tification. 

■ mode — is used to store the IPC_SET value for the operation permis- 
sions. 

■ rtm — is used to store the return integer value from the system call. 
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■ shmid — is used to store and pass the shared memory segment identif- 
ier to the system call. 

■ command — is used to store the code for the desired control command 
so that subsequent processing can be performed on it. 

■ choice — is used to determine which member for the IPC_SET control 
command is to be changed. 

■ shmid_ds — is used to receive the specified shared memory segment 
identifier's data structure when an IPC_STAT control command is per- 
formed. 

■ *buf — is a pointer passed to the system call which locates the data 
structure in the user memory area where the IPC_STAT control com- 
mand is to place its return values, or where the IPC_SET command 
gets the values to set. 

Note that the shmid^ds data structure in this program (line 16) uses the 
data structure located in the shm.h header file of the same name as a template 
for its declaration. This is an example of the advantage of local variables. 

The next important thing to observe is that although the *buf pointer is 
declared to be a pointer to a data structure of the shmid^ds type, it must also 
be initialized to contain the address of the user memory area data structure 
(line 17). 

Now that all of the required declarations have been explained for this pro- 
gram, this is how it works. 

First, the program prompts for a valid shared memory segment identifier 
which is stored at the address of the shmid variable (lines 18-20). This is 
required for every shmctl(2) system call. 

Then, the code for the desired control command must be entered (lines 
21-29), and it is stored at the address of the command variable. The code is 
tested to determine the control command for subsequent processing. 

If the IPC_STAT control command is selected (code 1), the system call is 
performed (lines 39 and 40), and the status information returned is printed out 
(lines 41-71). Note that if the system call is unsuccessful (line 146), the status 
information of the last successful call is printed out. In addition, an error mes- 
sage is displayed and the errno variable is printed out (lines 148 and 149). If 
the system call is successful, a message indicates this along with the shared 
memory segment identifier used (lines 151-154). 
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If the IPC_SET control command is selected (code 2), the first thing to do 
is get the current status information for the message queue identifier specified 
(lines 90-92). This is necessary because this example program provides for 
changing only one member at a time, and the system call changes all of them. 
Also, if an invalid value happened to be stored in the user memory area for 
one of these members, it would cause repetitive failures for this control com- 
mand until corrected. The next thing the program does is to prompt for a 
code corresponding to the member to be changed (lines 93-98). This code is 
stored at the address of the choice variable (line 99). Now, depending upon 
the member picked, the program prompts for the new value (lines 105-127). 
The value is placed at the address of the appropriate member in the user 
memory area data structure, and the system call is made (lines 128-130). 
Depending upon success or failure, the program returns the same messages as 
for IPC_STAT above. 

If the IPC-JRMID control command (code 3) is selected, the system call is 
performed (lines 132-135), and the shmid along with its associated message 
queue and data structure are removed from the UNIX Operating System. 
Note that the *buf pointer is not required as an argument to perform this con- 
trol command and its value can be zero or NULL. Depending upon success or 
failure, the program returns the same messages as for the other control com- 
mands. 

If the SHM_LOCK control command (code 4) is selected, the system call 
is performed (lines 137 and 138). Depending upon success or failure, the pro- 
gram returns the same messages as for the other control commands. 

If the SHM-.UNLOCK control command (code 5) is selected, the system 
call is performed (lines 140-142). Depending upon success or failure, the pro- 
gram returns the same messages as for the other control commands. 

The example program for the shmctl(2) system call follows. It is sug- 
gested that the source program file be named shmctLc and that the executable 
file be named shmctl. 

When compiling C programs that use floating point operations, the -£ 
option should be used on the cc command line. If this option is not used, the 
program will compile successfully, but when the program is executed it will 
fail. The -f option is not required, however, on your computer. 
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Shared Memory 




1 


/nSiis is a program to illustrate 


2 


♦♦the shared menory cxmtrol, shmcrtl( ) , 


3 


•♦system call capabilities. 


4 


V 


c 
D 


/♦Include necessary header files.*/ 


6 


#include <stdio . h> 


7 


^include <sys/types . h> 


8 


^include <sys/ipc . h> 


9 


#include <sys/shro.h> 


10 


/♦Start of main C language program*/ 


11 


iT«in( ) 


12 


{ 


13 


extern int ermo; 


14 


int uid, gid, mode; 


15 


int rtm, shmid, ooonand, choice; 


16 


struct shmid_ds shmid_ds, ♦buf ; 


17 


buf = &shmid_ds; 


18 


/*Get the shmid, and ccnnBnd.^/ 


19 


printf ("Enter the shndd = "); 


20 


scanf("96d", &shinid); 


21 


printf {"\nEnter the nimber fcrNn"); 


22 


printf ("the desired ccoomandiNn") ; 




Figure 9-16: shmctl(2) System Call Example (Sheet 1 of 6) 
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23 


printf ( "IPC^SEAT = 1\n" ) ; 


24 


printf ( "IPC_SEr = 2\n" ) ; 


25 


printf ( "IPC_FMID = 3\n" ) ; 


26 


printf ("SHMJOCK = 4\n"); 


27 


printf ( "SHMJJNLOCK = 5\n" ) ; 


28 


printf ("Entry = "); 


29 


scanfC'Sfid", &ociniend); 


30 


/•Check the valiies.V 


31 


printf ("\nshmid =9651, cxxnnand = 96i\n", 


32 


shmid, ociTinand) ; 


33 


switch (ocunand) 


34 


{ 


35 


case 1: /*Use shcoctlO to du^icate 


36 


the data structure for 


37 


shmid dn the shnad_ds area pointed 


38 


to ty buf and then print it out.*/ 


39 


rtm = shmctlC shmid, IPC_STAT, 


40 


buf); 


41 


printf ("NiflSie USER ID = 96d\n", 


42 


buf->shni_pem.uid) ; 


43 


printf ("©le OiOUP ID = 96d\n", 


44 


buf~>shna_jpenn. gld ) j 


45 


printf ("T3ie creator's ID = %d\n", 


46 


buf^>shnrperm.cuid) ; 


47 


printf {"HhB creator's group ID = ?6i\n". 


48 


buf->shmjperm.ogid) ; 


49 


printf ("Ihe operation pemdssions = 05to\n", 


50 


buf->dim_perm.rQode ) ; 


51 


printf ( "Ihe slot usage sequenceNn" ) ; 




Figure 9-16: 8hmctl(2) System Call Example (Sheet 2 of 6) 
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52 printf ("nmriDer = OJfixNn", 

53 buf->shni_penn.seq); 

54 printf ("TSie kefy= 096c\n", 

55 buf->shnrpenn.]cey) ; 

56 printf ("The segment size = S&iNji", 

57 buf->shm_segsz) ; 

58 prirrtf ("Uie pid of last shnpp = 96i\ii", 

59 buf->shm_lpid) ; 

60 printf ("TJie pid of creator = %d\ii", 

6 1 biif->shm_cpid ) ; 

62 printf ( "OSie current # attached = %d\n" , 

63 buf->shm_nattch) ; 

64 printf ( "The in menory # attached = %d\n" , 

65 buf->shm_cnattach) ; 

66 printf ("The last shmat tijne = %a\n", 

67 buf->shm_atiine) ; 

68 printfC'The last shndt time = %a\n", 

69 buf->shm_dtiine) ; 

70 printf ("The last change time = 9^d\n", 

7 1 buf ->shm_ctime ) ; 

72 break; 

/* Lines 73 - 87 deleted */ 



Figure 9-16: shmctl(2) System Call Example (Sheet 3 of 6) 
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88 case 2: /♦Select and change the desired 

89 inecDiber(s) of the data structure,*/ 

90 /*Get the original data for this shmid 

91 data structure first.*/ 

92 rtm = shinctl( shmid, IPC_STAT, buf); 

93 printf ( "\nEnter the nunflDer for the\n" ) ; 

94 printf ( "member to be changed:\n" ) ; 

95 printf ("shmjpenn.uid = 1\n"); 

96 printf ("shmjpem.gid = 2\n"); 

97 printf ("shm_penn.niode = 3\n"); 

98 printf ( "Entry = "); 

99 scanf("%a", &clx>ice); 

100 /*OBily one choice is allowed per 

101 pass as an illegal entry vdll 

102 cause repetitive failures until 

103 shiidd_ds is i^xilated with 

104 IPCJSTAT,*/ 

V J 

Figure 9-16: shmctl(2) System Call Example (Sheet 4 of 6) 
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105 switch( choice ) { 

106 case 1: 

107 printf ( "\nEnter USER ID = "); 

108 scanf ("5«d", &uid); 

109 buf->shm_perm.uid = uid; 

1 10 printf ( "NnUSER ID = %i\n" , 

111 buf->shin_penn.uid) ; 

112 break; 

113 case 2: 

114 printf ("\nEnter GROUP ID = "); 

115 scanf("%d", &gid); 

116 buf->shm_penn.gid = gid; 

117 printf ("\nGROUP ID = 96d\n", 

118 buf->shni_penn.gid) ; 

119 break; 

120 case 3: 

121 printf ( "NnEnter MCX>E = " ) ; 

122 scanf{"%D", Smode); 

123 buf~>shiii__penn.iiiDde = mode; 

124 printf ("\nMQDE = 05to\n", 

125 buf— >shin_penn.inode); 

126 break; 

127 } 

128 /*Do the change.*/ 

129 rtm = shnictl(shnd.d, IPC_SEr, 

130 buf); 

131 break; 

V J 

Figure 9-16: shmctl() System Call Example (Sheet 5 of 6) 
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132 case 3: /^Remove the shndd aloEng vdth its 

133 associated 

134 data structure.*/ 

135 rtm = shn)Ctl(shmid, IPC_RMID, NULL); 

136 break; 

137 case 4: /*Lock the shared menory segment*/ 

138 rtm = shmctlCshinid, SfMJOCK, NOLL); 

139 break; 

140 case 5: /^Unlock the shared roenory 

141 segment.*/ 

142 rtm = shnkctl( shndd, SHMJJNLOCK, NULL); 

143 break; 

144 } 

145 /*Perform tlie following if the call is unsuccessful.*/ 

146 if(rtm=-1) 

147 { 

148 printf {"\nflhe shnoctl system call failedINn"); 

149 printf ('"nie error number = 96a\n", ermo); 

150 } 

151 /*Retum the shndd upon successful ocnqpletion.*/ 

152 else 

153 printf ("NjiShmctl vias successful for shmid = 96d\n", 

154 shndd); 

155 exit (0); 

156 } 

V ) 

Figure 9-16: shmctl(2) System Call Example (Sheet 6 of 6) 
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Operations for Shared Memory 

This section gives a detailed description of using the shmat(2) and 
shmdt(2) system calls. It also provides an example program which allows all 
of their capabilities to be exercised. 

Using shmop 

The synopsis found in the shmop(2) entry in the Programmer's Reference 
Manual is as follows: 




#include <sys/ipc.h> 
#include <sys/shm.h> 

char ^shnat (shndd, shmadclr, stinflg) 
int shmid; 
char *shinatadr; 
ant shmflg; 

int shEDdt (shroaddr) 



char *shinadclr; 




Attaching a Shared Memory Segment 

The shmat(2) system call requires three arguments to be passed to it, and 
it returns a character pointer value. 

The system call can be cast to return an integer value. Upon successful 
completion, this value will be the address in core memory where the process 
is attached to the shared memory segment. When unsuccessful the value will 
be a -1. 
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The shmid argument must be a valid, non-negative, integer value. In 
other words, it must have already been created by using the shmget(2) system 
call. 



The shmaddr argument can be zero or user-supplied when passed to the 
shmat(2) system call. If it is zero, the UNIX Operating System picks the 
address of where the shared memory segment will be attached. If it is user- 
supplied, the address must be a valid address that the UNIX Operating Sys- 
tem would pick. The following table illustrates some typical address ranges 
for your computer: 



Note that these addresses are in chunks of 0x80000 hexadecimal (for the 
80286 Computer) and 0x1000 hexadecimal (for the 80386 Computer). It 
would be wise to let the operating system pick addresses so as to improve 
portability. 

The shmflg argument is used to pass the SHM_RND and 
SHN4_RDONLY flags to the shmat() system call. 

Further details are discussed in the example program for shmop(). If you 
have problems understanding the logic manipulations in this program, read 
the "Using shmget" section of this chapter; it goes into more detail than 
would be practical to do for every system call. 

Detaching Shared Memory Segments 

The shmdt(2) system call requires one argument to be passed to it, and it 
returns an integer value. Upon successful completion, zero is returned. When 
unsuccessful, a -1 is returned. 

Further details of this system call are discussed in the example program. 
If you have problems understanding the logic manipulations in this program, 
read the "Using shmget" section of this chapter; it goes into more detail than 
would be practical to do for every system call. 



80286 



80386 



0x01F70000 
0x02070000 
0x020F0000 
0x02170000 



0x80400000 
0x80800000 
0x80C00000 
0x81000000 
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Example Program 

The example program in this section (Figure 9-17) is a menu-driven pro- 
gram which allows all possible combinations of using the shmat(2) and 
shmdt(2) system calls to be exercised. 

From studying this program, you can observe the method of passing argu- 
ments and receiving return values. The user-written program requirements 
are pointed out. 

This program begins (lines 5-9) by including the required header files as 
specified by the shmop(2) entry in the Programmer's Reference Manual. Note 
that in this program ermo is declared as an external variable, and therefore 
the errno.h header file does not have to be included. 

Variable and structure names have been chosen to be as close as possible 
to those in the synopsis for the system call. Their declarations are self- 
explanatory. These names make the program more readable, and are perfectly 
legal since they are local to the program. Variables declared for this program 
and their purposes are as follows: 

■ flags— is used to store the codes of SHM_RND or SHM_RDONLY for 
the shmat(2) system call. 

■ addr — is used to store the address of the shared memory segment for 
the shmat(2) and shmdt(2) system calls. 

■ i — is used as a loop counter for attaching and detaching. 

■ attach — is used to store the desired number of attach operations. 

■ shmid — is used to store and pass the desired shared memory segment 
identifier. 

■ shmflg — is used to pass the value of flags to the shmat(2) system call. 

■ retrn — is used to store the return values from both system calls. 

■ detach — is used to store the desired number of detach operations. 

This example program combines both the shmat(2) and shmdt(2) system 
calls. The program prompts for the number of attachments and enters a loop 
until they are done for the specified shared memory identifiers. Then, the 
program prompts for the number of detachments to be performed and enters a 
loop until they are done for the specified shared memory segment addresses. 
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shmat 

The program prompts for the number of attachments to be performed, and 
the value is stored at the address of the attach variable (lines 17-21). 

A loop is entered using the attach variable and the i counter (lines 23-70) 
to perform the specified number of attachments. 

In this loop, the program prompts for a shared memory segment identifier 
(lines 24-27) and it is stored at the address of the shmid variable (line 28). 
Next, the program prompts for the address where the segment is to be 
attached (lines 30-34), and it is stored at the address of the addr variable (line 
35), Then, the program prompts for the desired flags to be used for the 
attachment (lines 37-44), and the code representing the flags is stored at the 
address of the flags variable (line 45). The flags variable is tested to deter- 
mine the code to be stored for the shmfig variable used to pass them to the 
shmat(2) system call (lines 46-57). The system call is made (line 60). If suc- 
cessful, a message so stating is displayed along with the attach address (lines 
66-68). If unsuccessful, a message so stating is displayed and the error code is 
displayed (lines 62, 63). The loop then continues until it finishes, 

shmdt 

After the attach loop completes, the program prompts for the number of 
detach operations to be performed (lines 71-75), and the value is stored at the 
address of the detach variable (line 76). 

A loop is entered using the detach variable and the i counter (lines 78-95) 
to perform the specified number of detachments. 

In this loop, the program prompts for the address of the shared memory 
segment to be detached (lines 79-83), and it is stored at the address of the 
addr variable (line 84). Then, the shmdt(2) system call is performed (line 87). 
If successful, a message so stating is displayed along with the address that the 
segment was detached from (lines 92 and 93). If unsuccessful, the error 
number is displayed (line 89). The loop continues until it finishes. 

The example program for the shmop(2) system calls follows. It is sug- 
gested that the program be put into a source file called shmop.c and then into 
an executable file called shmop. 

When compiling C programs that use floating point operations, the -f 
option should be used on the cc command line. If this option is not used, the 
program will compile successfully, but when the program is executed, it will 
fail. The -f option is not required, however, on your computer. 
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1 /*lhis is a program to illustrate 

2 **the shared memary operations, shnicq>( ) , 

3 **system call capabilities. 

4 •/ 

5 /♦Include necessary header files.*/ 

6 #include <stdio.h>' 

7 #include <sys/types.h> 

8 #iiicluQte <sys/ipc.h> 

9 ^include <sys/shm.h> 

10 /*Start of main C language program*/ 

11 Toadiii ) 

12 { 

13 extern int ermo; 

14 int flags, adtir, i, attach; 

15 int shraid, shmflg, retm, detach; 



16 /*Loop for attachments hy this process,*/ 

17 printf ("Enter the rannber of\n"); 

18 printf ("attachments for thisNn"); 

19 printf ("process (1-4).\n"); 

20 printfC Attachments = "); 

21 scanf("56d", Sattach); 

22 printf ("Nuinber of attadies = %d\n", attach); 



Figure 9-17: shmop() System Call Example (Sheet 1 of 4) 
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23 for(i = 1; i <= attach; i++) { 

24 /♦Ehter the shared mennry ID.*/ 

25 printf ( "\nEtoter the shndd of\n" ) ; 

26 printf ( "the shared memory segment to\n" ) ; 

27 printf ("be operated on = "); 

28 scanf("56a"» &shmid); 

29 printf ("Nnshmid = 9id\n", shtnid); 

30 /*Eaiter the value for shmaddr,*/ 

31 printf ( "\nEnter the value forNn" ) ; 

32 printf ("the shared mennry addressNn"); 

33 printf ("in hexadecimal :\n"); 

34 printf(" Shmaddr - "); 

35 scanf("96>c", &addr); 

36 printf ("Ohe desired address = 03t96c\n", addr); 

37 /^Specify the desired flags.*/ 

38 printf ( "NnEhter the correspancaingrNn" ) ; 

39 printf ( "number for the desiredXn" ) ; 

40 printf ( "flags :\n" ) ; 

41 printf ("SHM_RND = 1\n"); 

42 printf ("SHMJRDCNLY = 2\n"); 

43 printf ("SHM_RND and SHM_PDCNLY = 3Nn"); 

44 printf(" Flags = "); 

45 scanf("%d", &flags); 

V ) 

Figure 9-17: shmop() System Call Example (Sheet 2 of 4) 
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r \ 

46 suritch( flags) 

47 { 

48 case 1: 

49 shmflg = SHM_HND; 

50 break; 

51 case 2: 

52 shmflg = SHMjmiLy; 

53 break; 

54 case 3: 

55 shmflg = SHM_RND | SHM_RDCNLY; 

56 break; 

57 } 

58 prijitf ("NnFlags = 09to\n", shmflg); 

59 /*Do the shmat system call.*/ 

60 retm = (int) shmat (shndd, addr, shmflg); 

61 if(retm==-1) { 

62 printf ( "\nShmat failed. " ) ; 

63 printf ("Error = XdNn", erniD); 

64 } 

65 else { 

66 printf ( "\nShmat vras successfulNn" ) ; 

67 printf ("for shmid = 96d\n", shmid) ; 

68 printf ("The address = Ox56ic\n", retm); 

69 } 

70 } 

71 /*Lopp for detachments ty this process.*/ 

72 printf ("Enter the niEmber ofNii"); 

73 printf ("detachments for thisNn"); 

74 printf ("process (1-4).Nn'*); 

75 printf (" Detachments = "); 

V \ J 

Figure 9-17: shmopQ System Call Example (Sheet 3 of 4) 
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76 scanfC»9ed", Sdetach); 

77 printf( "Number of attaches = JficIXn", detach); 

78 for(i = 1; i <= detach; i++) { 

79 /*Enter the value for shmaddr.V 

80 priiitf("\nEn[ter the value forW); 

81 printfC'the shared menory addressNn" ) ; 

82 printf ( "in hexadecinialiNn") ; 

83 printf(" Shmaddr = 

84 soanf("56jc", &addr); 

85 printf("T5ie desired address = OxXxNn", addr); 

86 /♦Do the shndt system call.*/ 

87 retm = (ijit)shnxat(addr); 

88 if (retm == -1) { 

89 printf( "Error = 56a\n", ermo); 

90 } 

91 else { 

92 printf ("NnShndt was successfulVn" ) ; 

93 printf ("for address = 056jc\n", addr); 

94 } 

95 } 

96 } 

V J 

Figure 9-17: shmop() System Call Example (Sheet 4 of 4) 
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Ada Named after the Countess of Lovelace, the nineteenth 

century mathematician and computer pioneer, Ada is 
a high-level general-purpose programming language 
developed under the sponsorship of the U.S. Depart- 
ment of Defense. Ada was developed to provide con- 
sistency among programs originating in different 
branches of the military. Ada features include pack- 
ages that make data objects visible only to the 
modules that need them, task objects that facilitate 
parallel processing, and an exception-handling 
mechanism that encourages well-structured error pro- 
cessing. 

ANSI standard ANSI is the acronym for the American National 

Standards Institute. ANSI establishes guidelines in 
the computing industry, from the definition of ASCII 
to the determination of overall datacom system perfor- 
mance. ANSI standards have been established for 
both the Ada and FORTRAN programming languages, 
and a standard for C has been proposed. 

a.out file a^out is the default file name used by the link editor 

when it outputs a successfully compiled, executable 
file. a.out contains object files that are combined to 
create a complete working program. Object file for- 
mat is described in Chapter 11, "The Common Object 
File Format," and in a.out(4) in the Programmer's 
Reference Manual, 

application program An application program is a working program in a 

system. Such programs are usually unique to one 
type of user's work, although some application pro- 
grams can be used in a variety of business situations. 
An accounting application, for example, may well be 
applicable to many different businesses. 

archive An archive file or archive library is a collection of data 

gathered from several files. Each of the files within 
an archive is called a member. The command ar(l) 
collects data for use as a library. 
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argument An argument is additional information that is passed 

to a command or a function. On a command line, an 
argument is a character string or number that follows 
the command name and is separated from it by a 
space. There are two types of command-line argu- 
ments: options and operands. Options are immedi- 
ately preceded by a minus sign (-) and change the 
execution or output of the command. Some options 
can themselves take arguments. Operands are pre- 
ceded by a space and specify files or directories that 
will be operated on by the command. For example, in 
the command 

pr -t -h Heading file 

all elements after the pr are arguments, -t and -h are 
options. Heading is an argument to the -h option, 
and file is an operand. 

For a function, arguments are enclosed within a pair 
of parentheses immediately following the function 
name. The number of arguments can be zero or 
more; if more than two are present, they are separated 
by commas and the whole list enclosed by the 
parentheses. The formal definition of a function, such 
as might be found on a page in Section 3 of the 
Programmer's Reference Manual, describes the number 
and data type of argument(s) expected by the func- 
tion. 

ASCII ASCII is an acronym for American Standard Code for 

Information Interchange, a standard for data represen- 
tation that is followed in the UNIX System. ASCII 
code represents alphanumeric characters as binary 
numbers. The code includes 128 upper- and lower- 
case letters, numerals, and special characters. Each 
alphanumeric and special character has an ASCII code 
(binary) equivalent that is one byte long. 
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assembler 



assembly language 
BASIC 



branch table 



buffer 



byte 



The assembler is a translating program that accepts 
instructions written in the assembly language of the 
computer and translates them into the binary 
representation of machine instructions. In many 
cases, the assembly language instructions map 1 to 1 
with the binary machine instructions. 

A programming language that uses the instruction set 
that applies to a particular computer. 

BASIC is a high-level conversational programming 
language that allows a computer to be used much like 
a complex electronic calculating machine. The name 
is an acronym for Beginner's All-purpose Symbolic 
Instruction Code. 

A branch table is an implementation technique for fix- 
ing the addresses of text symbols, without forfeiting 
the ability to update code. Instead of being directly 
associated with function code, text symbols label jump 
instructions that transfer control to the real code. 
Branch table addresses do not change, even when one 
changes the code of a routine. Jump table is another 
name for branch table. 

A buffer is a storage space in computer memory 
where data are stored temporarily into convenient 
units for system operations. Buffers are often used by 
programs, such as editors, that access and alter text or 
data frequently. When you edit a file, a copy of its 
contents is read into a buffer where you make changes 
to the text. For the changes to become part of the 
permanent file, you must write the buffer contents 
back into the permanent file. This replaces the con- 
tents of the file with the contents of the buffer. When 
you quit the editor, the contents of the buffer are 
flushed. 

A byte is a unit of storage in the computer. On many 
UNIX Systems, a byte is eight bits (binary digits), the 
equivalent of one character of text. 



GLOSSARY G-3 



Glossary 



byte order Byte order refers to the order in which data are stored 

in computer memory. 

C The C programming language is a general-purpose 

programming language that features economy of 
expression, control flow, data structures, and a variety 
of operators. It can be used to perform both high- 
level and low-level tasks. Although it has been called 
a system programming language, because it is useful 
for writing operating systems, it has been used equally 
effectively to write major numerical, text-processing, 
and database programs. The C programming 
language was designed for and implemented on the 
UNIX System; however, the language is not limited to 
any one operating system or machine. 

C compiler The C compiler converts C programs into assembly 

language programs that are eventually translated into 
object files by the assembler. 

C preprocessor The C preprocessor is a component of the C Compila- 

tion System. In C source code, statements preceded 
with a pound sign (#) are directives to the preproces- 
sor. Command line options of the cc(l) command 
may also be used to control the actions of the prepro- 
cessor. The main work of the preprocessor is to per- 
form file inclusions and macro substitution. 

CCS CCS is an abbreviation for C Compilation System, 

which is a set of programming language utilities used 
to produce object code from C source code. The 
major components of a C Compilation System are a C 
preprocessor, C compiler, assembler, and link editor. 
The C preprocessor accepts C source code as input, 
performs any preprocessing required, and passes the 
processed code to the C compiler. The C compiler 
produces assembly language code that it passes to the 
assembler. The assembler, in turn, produces object 
code that can be linked to other object files by the 
link editor. The object files produced are in the Com- 
mon Object File Format (COFF). Other components 
of CCS include a symbolic debugger, an 
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optimizer that makes the code produced as efficient as 
possible, productivity tools that are used to read and 
manipulate object files, and libraries that provide run- 
time support, access to system calls, input/output, 
string manipulation, mathematical functions, and 
other code-processing functions. 

COBOL COBOL is an acronym for COmmon Business 

Oriented Language. COBOL is a high-level program- 
ming language designed for business and commercial 
applications. The English-language statements of 
COBOL provide a relatively machine-independent 
method of expressing a business-oriented problem to 
the computer. 

COFF COFF is an acronym for Common Object File Format. 

COFF refers to the format of the output file produced 
on some UNIX Systems by the assembler and the link 
editor. This format is also used by other operating 
systems. The following are some of its key features: 

□ Applications may add system-dependent informa- 
tion to the object file without causing access utili- 
ties to become obsolete. 

□ Space is provided for symbolic information used 
by debuggers and other applications. 

□ Users may make some modifications in the object 
file construction at compile time. 

command A command is the term commonly used to refer to an 

instruction that a user types at a computer terminal 
keyboard. It can be the name of a file that contains 
an executable program or a shell script that can be 
processed or executed by the computer on request. A 
command is composed of a word or string of letters 
and/or special characters that can continue for several 
(terminal) lines, up to 256 characters. A command 
name is sometimes used interchangeably with a pro- 
gram name. 
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command line A command line is composed of the command name 

followed by any argument(s) required by the com- 
mand or optionally included by the user. The manual 
page for a command includes a command line 
synopsis in a notation designed to show the correct 
way to type in a command, with or without options 
and arguments. 

compiler A compiler transforms the high-level language instruc- 

tions in a program (the source code) into object code 
or assembly language. Assembly language code may 
then be passed to the assembler for further translation 
into machine instructions. 

core Core is a (mostly archaic) synonym for primary 

memory. 

core file A core file is an image of a terminated process saved 

for debugging. A core file is created under the name 
"core" in the current directory of the process when 
an abnormal event occurs resulting in the process' ter- 
mination. A list of these events is found in the sig- 
nal(2) manual page in Section 2 of the Programmer's 
Reference Manual, 

core image Core image is a copy of all the segments of a running 

or terminated program. The copy may exist in main 
storage, in the swap area, or in a core file. 

curses curses(3X) is a library of C routines that are designed 

to handle input, output, and other operations in 
screen management programs. The name curses 
comes from the cursor optimization that the routines 
provide. When a screen management program is run, 
cursor optimization minimizes the amount of time a 
cursor has to move about a screen to update its con- 
tents. The program refers to the terminfo(4) database 
at run time to obtain the information that it needs 
about the screen (terminal) being used. See ter- 
minfo(4) in the Programmer's Reference Manual 
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data symbol 



database 



debug 

default 

delimiter 



directory 



A data symbol names a variable that may or may not 
be initialized. Normally, these variables reside in 
read/write memory during execution. See text sym- 
bol. 

A database is a bank of information on a particular 
subject or subjects. On-line databases are designed so 
that by using subject headings, key words, or key 
phrases you can search for, analyze, update, and print 
out data. 

Debugging is the process of locating and correcting 
errors in computer programs. 

A default is the way a computer will perform a task in 
the absence of other instructions. 

A delimiter is an initial character that identifies the 
next character or character string as a particular kind 
of argument. Delimiters are typically used for option 
names on a command line; they identify the associ- 
ated word as an option (or as a string of several 
options if the options are bundled). In the UNIX Sys- 
tem command syntax, a minus sign (-) is most often 
the delimiter for option names, for example, -s or -n, 
although some commands also use a plus sign (+). 

A directory is a type of file used to group and organ- 
ize other files or directories. A directory consists of 
entries that specify further files (including directories) 
and constitutes a node of the file system. A subdirec- 
tory is a directory that is pointed to by a directory one 
level above it in the file system organization. 

The ls(l) command is used to list the contents of a 
directory. When you first log onto the system, you 
are in your home directory ($HOME). You can move 
to another directory by using the cd(l) command and 
you can print the name of the current directory by 
using the pwd(l) command. You can also create new 
directories with the mkdir(l) command and remove 
empty directories with rmdir(l). 
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A directory name is a string of characters that identi- 
fies a directory. It can be a simple directory name, the 
relative path name or the full path name of a direc- 
tory. 

dynamic linking Dynamic linking refers to the ability to resolve sym- 

bolic references at run time. Systems that use 
dynamic linking can execute processes without resolv- 
ing unused references. See static linking. 

environment An environment is a collection of resources used to 

support a function. In the UNIX System, the shell 
environment is composed of variables whose values 
define the way you interact with the system. For 
example, your environment includes your shell 
prompt string, specifics for backspace and erase char- 
acters, and commands for sending output from your 
terminal to the computer. 

An environment variable is a shell variable such as 
$HOME (which stands for your login directory) or 
$PATH (which is a list of directories the shell will 
search through for executable commands) that is part 
of your environment. When you log in, the system 
executes programs that create most of the environ- 
mental variables that you need for the commands to 
work. These variables come from /etc/profile, a file 
that defines a general working environment for all 
users when they log onto a system. In addition, you 
can define and set variables in your personal .profile 
file, which you create in your login directory to tailor 
your own working environment. You can also tem- 
porarily set variables at the shell level. 
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executable file 



exit 



An executable fUe is a file that can be processed or 
executed by the computer without any further transla- 
tion. That is, when you type in the file name, the 
commands in the file are executed. An object file that 
is ready to run (ready to be copied into the address 
space of a process to run as the code of that process) 
is an executable file. Files containing shell commands 
are also executable. A file may be given execute per- 
mission by using the chmod(l) command. In addition 
to being ready to run, a file in the UNIX System needs 
to have execute permission. 

Exit is a specific system call that causes the termina- 
tion of a process. The exit(2) call will close any open 
files and clean up most other information and 
memory which was used by the process. 



exit status: return code 



exported symbol 



expression 



file 



An exit status or return code is a code number 
returned to the shell when a command is terminated 
that indicates the cause of termination. 

An exported symbol is a symbol that a shared library 
defines and makes available outside the library. See 
imported symbol. 

An expression is a mathematical or logical symbol or 
meaningful combination of symbols. See regular 
expression. 

A file is an identifiable collection of information that, 
in the UNIX System, is a member of a file system. A 
file is known to the UNIX System as an inode plus 
the information the inode contains that tells whether 
the file is a plain file, a special file, or a directory. A 
plain file may contain text, data, programs, or other 
information that forms a coherent unit. A special file 
is a hardware device or portion thereof, such as a disk 
partition. A directory is a type of file that contains the 
names and inode addresses of other plain, special, or 
directory files. 
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file and record locking 

The phrase " file and record locking " refers to 
software that protects records in a data file against the 
possibility of being changed by two users at the same 
time. Records (or the entire file) may be locked by 
one authorized user while changes are made. Other 
users are thus prevented from working with the same 
record until the changes are completed. 

file descriptor A file descriptor is a number assigned by the operat- 

ing system to a file when the file is opened by a pro- 
cess. File descriptors 0, 1, and 2 are reserved; is 
reserved for standard input (stdin), 1 is reserved for 
standard output (stdout), and 2 is reserved for stand- 
ard error output (stderr). 

file system A UNIX System file system is a hierarchical collection 

of directories and other files that are organized in a 
tree structure. The base of the structure is the root (/) 
directory; other directories, all subordinate to the root, 
are branches. The collection of files can be mounted 
on a block special file. Each file of a file system 
appears exactly once in the inode list of the file sys- 
tem and is accessible via a single, unique path from 
the root directory of the file system, 

filter A filter is a program that reads information from 

standard input, acts on it in some way, and sends its 
results to standard output. It is called a filter because 
it can be used as a data transformer in a pipeline. 
Filters are different from editors and other commands 
because filters do not change the contents of a file. 
Examples of filters are grep(l) and tail(l), which 
select and output part of the input; sort(l), which 
sorts the input; and wc(l), which counts the number 
of words, characters, and lines in the input, sed(l) 
and awk(l) are also filters but they are called pro- 
grammable filters or data transformers because, in 
addition to the data to be transformed, a program 
must be supplied as input. 
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flag A flag or option is used on a command line to signal a 

specific condition to a command or to request particu- 
lar processing. UNIX System flags are usually indi- 
cated by a leading hyphen (-). The word option is 
sometimes used interchangeably with flag. Flag is 
also used as a verb to mean "to point out" or "to 
draw attention to". See option. 

fork fork(2) is a system call that divides a new process into 

two processes, the parent process and the child 
processes, with separate, but initially identical, text, 
data, and stack segments. After duplication, the child 
(created) process is given a return code of and the 
parent process is given the process id of the newly 
created child as the return code. 

FORTRAN is an acronym for FORmula TRANslator. 
It is a high-level programming language originally 
designed for scientific and engineering calculations but 
is now widely adapted for many business uses also. 

A function is a task done by a computer. In most 
modem programming languages, programs are made 
up of functions and procedures which perform small 
parts of the total job to be done. 

header file A header file is used in programming and in docu- 

ment formatting. In a programming context, a header 
file is a file that usually contains shared data declara- 
tions that are to be copied into source programs as 
they are compiled. A header file includes symbolic 
names for constants, macro definitions, external vari- 
able references and inclusion of other header files. 
The name of a header file customarily ends with '.h' 
(dot-h). Similarly, in a document formatting context, 
header files contain general formatting macros that 
describe a common document type and can be used 
with many different document bodies. 

high-level language A high-level language is a computer programming 

language such as C, FORTRAN, COBOL, or PASCAL 
that uses symbols and command statements 



FORTRAN 



function 
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representing actions the computer is to perform, the 
exact steps for a machine to follow. A high-level 
language must be translated into machine language by 
a compilation system before a computer can execute it. 
A characteristic of a high-level language is that each 
statement usually translates into a series of machine 
language instructions. Low-level details of the 
computer's internal organization are left to the compi- 
lation system. 

host machine A host machine is the machine on which an a.out file 

is built. 

imported symbol An imported symbol is a symbol used but not defined 

by a shared library. See exported symbol. 

interpreted language An interpreted language is a high-level language that 

is not either translated by a compilation system or 
stored in an executable object file. The statements of 
a program in an interpreted language are translated 
each time the program is executed. 

Interprocess Communication 

Interprocess Communication describes software that 
enables independent processes running at the same 
time to exchange information through messages, 
semaphores, or shared memory. 

interrupt An interrupt is a break in the normal flow of a system 

or program. Interrupts are initiated by signals that are 
generated by a hardware condition or a peripheral 
device indicating that a certain event has happened. 
When the interrupt is recognized by the hardware, an 
interrupt handling routine is executed. An interrupt 
character is a character (normally ASCII) that, when 
typed on a terminal, causes an interrupt. You can 
usually interrupt UNIX System programs by pressing 
the delete or break keys, by typing CTRL-D, or by 
using the kill(l) command. 

I/O (Input/Output) I/O is the process by which information enters (input) 

and leaves (output) the computer system. 
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kernel The kernel (comprising 5 to 10 percent of the operat- 

ing system software) is the basic resident software on 
which the UNIX System relies. It is responsible for 
most operating system functions. It schedules and 
manages work done by the computer and maintains 
the file system. The kernel has its own text, data, and 
stack areas. 

lexical analysis Lexical analysis is the process by which a stream of 

characters (often comprising a source program) is sub- 
divided into its elementary words and symbols (called 
tokens). The tokens include the reserved words of the 
language, its identifiers and constants, and special 
symbols such as =, :=, and ;. Lexical analysis enables 
you to recognize, for example, that the stream of char- 
acters 'print(" hello, universe")' is to be analyzed into 
a series of tokens beginning with the word 'print' [not 
with the string 'print("h.']. In compilers, a lexical 
analyzer is often called by the compiler's syntactic 
analyzer or parser, which determines the statements 
of the program (that is, the proper arrangements of its 
tokens). 

library A library is an archive file that contains object code 

and/or files for programs that perform common tasks. 
The library provides a common source for object code, 
thus saving space by providing one copy of the code 
instead of requiring every program that wants to 
incorporate the functions in the code to have its own 
copy. The link editor may select functions and data 
as needed. 
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link editor 
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manual page 



null pointer 
object code 



optimizer 



A link editor, or loader, collects and merges separately 
compiled object files by linking together object files 
and the libraries that are referenced into executable 
load modules. The result is an a.out file. Link editing 
may be done automatically when you use the compi- 
lation system to process your programs on the UNIX 
System. You can also link edit previously compiled 
files by using the ld(l) command. 

The magic number is contained in the header of an 
a.out file. It indicates what the type of the file is, 
whether it is made up of shared or non-shared text, 
and on which processor the file is executable. 

A makefile is a file that lists dependencies among the 
source-code files of a software product and methods 
for updating them, usually by recompilation. The 
make(l) command uses the makefile to maintain self- 
consistent software. 

A manual page, or "man page" in UNIX System jar- 
gon, is the repository for the detailed description of a 
command, a system call, a subroutine, or some other 
UNIX System component. 

A null pointer is a C pointer with a value of 0. 

Object code is executable machine-language code pro- 
duced from source code or from other object files by 
an assembler or a compilation system. An object file 
is a file of object code and associated data. An object 
file that is ready to run is an executable file. 

An optimizer, an optional step in the compilation pro- 
cess, improves the efficiency of the assembly language 
code. The optimizer reduces the space used by, and 
speeds the execution time of, the code. 
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parent process 
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An option is an argument used in a command line to 
modify program output by modifying the execution of 
a command. It is usually one character preceded by a 
hyphen (-). When you do not specify any options, the 
command will execute according to its default options. 
For example, in the command line 

Is -a -1 directory 

-a and -1 are the options that modify the ls(l) com- 
mand to list all directory entries, including entries 
whose names begin with a period (.), in the long for- 
mat (including permissions, size, and date). 

A parent process occurs when a process is split into 
two, a parent process and a child process, with 
separate, but initially identical text, data, and stack 
segments. 

To parse is to analyze a sentence in order to identify 
its components and to determine their grammatical 
relationship. In computer terminology the word has a 
similar meaning, but instead of sentences, program 
statements or commands are analyzed. 

PASCAL is a multipurpose high-level programming 
language often used to teach programming. It is 
based on the ALGOL programming language and 
emphasizes structured programming. 

A path name is a way of designating the exact loca- 
tion of a file in a file system. It is made up of a series 
of directory names that proceed down the hierarchical 
path of the file system. The directory names are 
separated by a slash character (/). The last name in 
the path is either a file or another directory. If the 
path name begins with a slash, it is called a full path 
name; the initial slash means that the path begins at 
the root directory. 
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A path name that does not begin with a slash is 
known as a relative path name, meaning relative to 
the present working directory. A relative path name 
may begin either with a directory name or with two 
dots followed by a slash (../). One that begins with a 
directory name indicates that the ultimate file or direc- 
tory is below the present working directory in the 
hierarchy. One that begins with ../ indicates that the 
path first proceeds up the hierarchy; ../is the parent 
of the present working directory. 

permissions Permissions are a means of defining a right to access a 

file or directory in the UNIX System file system. They 
are granted separately to you, the owner of the file or 
directory, your group, and all others. There are three 
basic permissions: 

□ Read permission (r) includes permission to cat, 
pg/ Ip/ and cp a file. 

□ Write permission (w) is the permission to change 
a file. 

□ Execute permission (x) is the permission to run an 
executable file. 

Permissions can be changed with the chmod(l) com- 
mand (see the User's /System Administrator's Reference 
Manual), 

pipe A pipe causes the output of one command to be used 

as the input for the next command so that the two run 
in sequence. You can do this by preceding each com- 
mand after the first command with the pipe symbol 
( I ), which indicates that the output from the process 
on the left should be routed to the process on the 
right. For example, in the command 

who I wc -1 

the output from the who(l) command, which lists the 
users who are logged on to the system, is used as 
input for the word-count command, wc(l), with the 1 
option. The result of this pipeline (succession of com- 
mands connected by pipes) is the number of people 
who are currently logged on to the system. 
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portability Portability describes the degree of ease with which a 

program or a library can be moved or ported from one 
system to another. Portability is desirable because 
once a program is developed it is used on many sys- 
tems. If the program writer must change the program 
in many different ways before it can be distributed to 
the other systems, time is wasted, and each modifica- 
tion increases the chances for an error. 

preprocessor Preprocessor is a generic name for a program that 

prepares an input file for another program. For exam- 
ple, neqn and tbl are preprocessors for nroff . grap is 
a preprocessor for pic. cpp(l) is a preprocessor for 
the C compiler. 

process A process is a program that is at some stage of execu- 

tion. In the UNIX System, it also refers to the execu- 
tion of a computer environment, including contents of 
memory, register values, name of the current working 
directory, status of files, information recorded at login 
time, etc. Every time you type the name of a file that 
contains an executable program, you initiate a new 
process. Shell programs can cause the initiation of 
many processes because they can contain many com- 
mand lines. 

The process id is a unique system-wide identification 
number that identifies an active process. The process 
status command, ps(l), prints the process ids of the 
processes that belong to you. 

program A program is a sequence of instructions or commands 

that cause the computer to perform a specific task, for 
example, changing text, making a calculation, or 
reporting system status. A subprogram is part of a 
larger program and can be compiled independently. 

regular expression A regular expression is a string of alphanumeric char- 
acters and special characters that describe a character 
string. It is a shorthand way of describing a pattern to 
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be searched for in a file. The pattern-matching func- 
tions of ed(l) and grep(l), for example, use regular 
expressions. 

routine A routine is a discrete section of a program to accom- 

plish a set of related tasks 

semaphore In the UNIX System, a semaphore is a sharable short 

unsigned integer maintained through a family of sys- 
tem calls which include calls for increasing the value 
of the semaphore, setting its value, and for blocking 
waiting for its value to reach some value. Sema- 
phores are part of the UNIX System IPC facility. 

shared library Shared libraries include object modules that may be 

shared among several processes at execution time. 

shared memory Shared memory is an IPC (interprocess communica- 

tion) facility in which two or more processes can share 
the same data space. 

shell The shell is the UNIX System program— sh(l)— 

responsible for handling all interaction between you 
and the system. It is a command language interpreter 
that understands your commands and causes the com- 
puter to act on them. The shell also establishes the 
environment at your terminal. A shell normally is 
started for you as part of the login process. Three 
shells, the Bourne shell, the Kom shell, and the C 
shell, are popular. The shell can also be used as a 
programming language to write procedures for a 
variety of tasks. 

signal: signal number 

A signal is a message that you send to processes or 
processes send to one another. The most common 
signals you might send to a process are ones that 
would cause the process to stop: for example, inter- 
rupt, quit, or kill. A signal sent by a running process 
is usually a sign of an exceptional occurrence that has 
caused the process to terminate or divert from the 
normal flow of control. 
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Source code is the programming-language version of a 
program. Before the computer can execute the pro- 
gram, the source code must be translated to machine 
language by a compilation system or an interpreter. 

Standard error is an output stream from a program. It 
is normally used to convey error messages. In the 
UNIX System, the default case is to associate standard 
error with the user's terminal. 

Standard input is an input stream to a program. In 
the UNIX System, the default case is to associate 
standard input with the user's terminal. 

Standard output is an output stream from a program. 
In the UNIX System, the default case is to associate 
standard output with the user's terminal. 

-output 

stdio(3S) is a collection of functions for formatted and 
character-by-character input/output at a higher level 
than the basic read, write, and open operations. 

Static linking refers to the requirement that symbolic 
references be resolved before run time. See dynamic 
linking. 

□ A stream is an open file with buffering provided 
by the stdio package, 

□ A stream is a full duplex, processing and data 
transfer path in the kernel. It implements a con- 
nection between a driver in kernel space and a 
process in user space, providing a general charac- 
ter input/output interface for user processes, 

A string is a contiguous sequence of characters treated 
as a unit. Strings are normally bounded by white 
space(s), tab(s), or a character designated as a separa- 
tor. A string value is a specified group of characters 
symbolized to the shell by a variable. 

strip(l) is a command that removes the symbol table 
and relocation bits from an executable file. 
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A subroutine is a program that defines desired opera- 
tions and may be used in another program to produce 
the desired operations. A subroutine can be arranged 
so that control may be transferred to it from a master 
routine and so that, at the conclusion of the subrou- 
tine, control reverts to the master routine. Such a 
subroutine is usually called a closed subroutine. A 
single routine may be simultaneously a subroutine 
with respect to another routine and a master routine 
with respect to a third. 

A symbol table describes information in an object file 
about the names and functions in that file. The sym- 
bol table and relocation bits are used by the link edi- 
tor and by the debuggers. 

The value of a symbol, typically its virtual address, is 
used to resolve references. 

□ Command syntax is the order in which command 
names, options, option arguments, and operands 
are put together to form a command line. The 
command name is first, followed by options and 
operands. The order of the options and the 
operands varies from command to command. 

□ Language syntax is the set of rules that describes 
how the elements of a programming language 
may legally be used. 

A system call is a request by an active process for a 
service performed by the UNIX System kernel, such as 
I/O, process creation, etc. All system operations are 
allocated, initiated, monitored, manipulated, and ter- 
minated through system calls. System calls allow you 
to request the operating system to do some work that 
the program would not normally be able to do. For 
example, the getuid(2) system call allows you to 
inspect information that is not normally available, 
since it resides in the operating system's address 
space. 



G-20 PROGRAMMER'S GUIDE 



target machine A target machine is the machine on which an a.out 

file is run. While it may be the same machine on 
which the a,out file was produced, the term implies 
that it may be a different machine. 

TCP/IP (Transmission Control Protocol/Internetwork Protocol) 

TCP/IP is a connection-oriented, end-to-end reliable 
protocol designed to fit into a layered hierarchy of 
protocols that support multi-network applications. It 
is the Department of Defense standard in packet net- 
works. 

A terminal definition is an entry in the terminfo(4) 
database that describes the characteristics of a termi- 
nal. See terminfo(4) and curses(3X) in the 
Programmer's Reference Manual. 

□ terminfo is a group of routines within the curses 
library that handle certain terminal capabilities. 
For example, if your terminal has programmable 
function keys, you can use these routines to pro- 
gram the keys. 

□ terminfo is a database containing the compiled 
descriptions of many terminals that can be used 
with curses(3X) screen management programs. 
These descriptions specify the capabilities of a ter- 
minal and how it performs various operations ( — 
for example), how many lines and columns it has, 
and how its control characters are interpreted. A 
curses(3X) program refers to the database at run 
time to obtain information it needs about the ter- 
minal being used. 

See curses(3X) in the Programmer's Reference Manual. 
terminfo(4) routines can be used in shell programs, as 
well as C programs. 

text symbol A text symbol is a symbol, usually a function name, 

that is defined in the .text portion of an a.out file. 



terminal definition 



terminfo 
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A tool is a program, or package of programs, that per- 
forms a given task. 

A trap is a condition caused by an error where a pro- 
cess state transition occurs and a signal is sent to the 
currendy running process. 

UNIX Operating System 

The UNIX Operating System is a general-purpose, 
multiuser, interactive, time-sharing operating system 
developed by AT&T. An operating system is the 
software on the computer under which all other 
software runs. The UNIX Operating System has two 
basic parts: 

□ The kernel is the program that is responsible for 
most operating system functions. It schedules 
and manages all the work done by the computer 
and maintains the file system. It is always run- 
ning and is invisible to users. 

□ The shell is the program responsible for handling 
all interaction between users and the computer. 
It includes a powerful command language called 
shell language. 

The utility programs or UNIX System commands are 
executed using the shell, and allow users to communi- 
cate with each other, edit and manipulate files, and 
write and execute programs in several programming 
languages. 

A userid is an integer value, usually associated with a 
login name, used by the system to identify owners of 
files and directories. The userid of a process becomes 
the owner of files created by the process and descen- 
dent (forked) processes. 

A utility is a standard, permanently available program 
used to perform routine functions or to assist a pro- 
grammer in the diagnosis of hardware and software 
errors, for example, a loader, editor, debugging, or 
diagnostics package. 
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□ A variable in a computer program is an object 
whose value may change during the execution of 
the program, or from one execution to the next. 

□ A variable in the shell is a name representing a 
string of characters (a string value). 

□ A variable normally set only on a command line 
is called a parameter (positional parameter and 
keyword parameter). 

□ A variable may be simply a name to which the 
user (user-defined variable) or the shell itself may 
assign string values. 

White space is one or more spaces, tabs, or newline 
characters. White space is normally used to separate 
strings of characters and is required to separate the 
command from its arguments on a command line. 

A window is a screen within your terminal screen that 
is set off from the rest of the screen. If you have two 
windows on your screen, they are independent of 
each other and the rest of the screen. 

The most common way to create windows on a UNIX 
System is by using the layers capability of the 
TELETYPE 5620 Dot-Mapped Display. Each window 
you create with this program has a separate shell run- 
ning it. Each one of these shells is called a layer. 

If you do not have this facility, the shl(l) command, 
which stands for shell layer, offers a function similar 
to the layers program. You cannot create windows 
using shl(l), but you can start different shells that are 
independent of each other. Each of the shells you 
create with shl(l) is called a layer. 

A word is a unit of storage in a computer that is com- 
posed of bytes of information. The number of bytes 
in a word depends on the computer you are using. 
The 80286 Computer has 16 bits or 2 bytes per word. 
The 80386 Computer has 32 bits or 4 bytes per word, 
and 16 bits or 2 bytes per half word. 
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